The QBNews                                                     Page  4
     Volume  3, Number  2                                     June 29, 1992

     Indexing 101 by Richard Vannoy
     
     So here's the basics of file indexing.  I'll use the file extensions
     .DBF to indicate a normal RANDOM data file and .NDX to show an index
     file.  We'll start with a data file:
     
     TYPE EmployeeData
        lastname AS STRING * 15
        firstname AS STRING * 10
        employeeNumber AS INTEGER
        SSN AS STRING * 11
        salary AS SINGLE
        hireDate AS STRING * 8
        'Many more fields could (and usually do) go here for a
        'typical company data base.  Imafine that there are
        'perhaps 60 to 80 fields using maybe 400 to 600 bytes
        'each. It's important to remember, as we go along, that
        'ALL of this data for all the employees is much too
        'massive to fit in the memory of most machines.
     END TYPE
     DIM EMPL AS EmployeeData
     
     Now let's suppose the personnel department wants us to design the data
     base so they can quickly retrieve an employee's information, and they
     want to be able to enter EITHER the first and last name OR the
     employee number.  As you could guess, if you were sitting at a
     terminal all day, you would quickly find out that if you can enter
     just the employee number instead of the full name, that data entry
     will be much faster and more efficient.  So if we are dealing with
     documents that have the employee number (as most work type documents
     do), then that would be quicker.  But we also have to allow for the
     case where the employee number is not known.  So, we must also allow
     the employee's name to be entered also.
     
     So we decide to index the data base in two ways; first by employee
     number:
     
     We need a small data structure to contain just two things: the
     employee number (what we're indexing on), and the record number where
     that employee's information can be found.  So..
     
     TYPE EmpNumberIndex
        EmpNumber AS SINGLE
        RecordNumber AS SINGLE
     END TYPE
     DIM NUMB AS EmpNumberIndex
     
     Then we need a small data structure for the first name/Last name
     information:
     
     TYPE EmpNameIndex
        BothNames AS STRING * 26
        RecordNumber AS SINGLE
     END TYPE
     
     The QBNews                                                     Page  5
     Volume  3, Number  2                                     June 29, 1992

     DIM ENAM AS EmpNameIndex
     
     Note that the BothNames field is big enough to hold the first name, a
     space, and then the last name.  Also note that both EmpNumberIndex and
     EmpNameIndex have the RecordNumber for where that employee is stored,
     since we don't need ALL the employee info, just a pointer to where we
     can find it.
     
     We're going to open three files...
     
     OPEN "EMPLOYEE.DBF" FOR RANDOM AS #1 LEN = LEN(EMPL)
     OPEN "EMPNUMBR.NDX" FOR RANDOM AS #2 LEN = LEN(NUMB)
     OPEN "EMPLNAME.NDX" FOR RANDOM AS #3 LEN = LEN(ENAM)
     
     Let's start with three employees.
     EMPLOYEE.DBF will have:
     RECLast Name  First Name  Employee Numb.  AND all the------>
     -- ---------  ----------  --------------  rest of the------>
      1 Vannoy     Richard         46          fields for------->
      2 Que        Suzie           32          for several------>
      3 McGee      Bobbie          23          hundred bytes---->
     
     EMPNUMBR.NDX will have:
     Employee Numb.  REC
     --------------  ---
           46         1
           32         2
           23         3
     
     EMPLNAME.NDX will have:
     BothNames      REC
     ---------      ---
     Vannoy Richard  1
     Que Suzie       2
     McGee Bobbie    3
     
     The last thing we have to do before using this data base is to sort
     each .NDX file so we can use a binary search to quickly find what we
     want.
     
     EMPLNUMBR.NDX will now look like:
     Employee Numb.  REC
     --------------  ---
          23          3    (Sorted by Employee number)
          32          2
          46          1
     
     EMPLNAME.NDX will now look like:
     BothNames      REC
     ---------      ---
     McGee Bobbie    3     (Sorted by name)
     Que Suzie       2
     Vannoy Richard  1
     
     
     The QBNews                                                     Page  6
     Volume  3, Number  2                                     June 29, 1992

     So far it seems like a lot of work to do all this for just three
     records, but remember that we probably have hundreds, maybe even
     thousands of records, so be aware that the problem of NOT having to
     search thousands of records for a name OR an employee number (and BOTH
     can never be in the same order in the original .DBF file) is what got
     us here in the first place.
     
     In order to understand the speed of an indexing system, it is
     necessary to know how a binary search works, so we'll digress a moment
     to cover it.  The first requirement of a binary search is that the
     records we are searching MUST BE IN ORDER, either numerically or
     alphabetically, for the search to work.  That's why we sorted the .NDX
     files above.
     
     Let's say we are searching the EMPLNAME.NDX file for me (Vannoy
     Richard), all the records are in alphabetical order by name, there are
     1000 records and my name happens to be at record 687.
     
     First set LOW to 1 and HIGH to the number of records (1000).  the
     formula for our first LOOK is going to be (HIGH + LOW)\2 or 500.  LOOK
     at record #500.  Is the name there (alphabetically) greater than or
     less than MY name?  Well, since I'm at 687, the name at 500 has to be
     LESS.  We now readjust and recompute LOOK as folows:
     
     IF NameAt LOOK < MyName THEN
        LOW = 500
     ELSE
        HIGH = 500
     END IF
     NewLOOK = (HIGH + LOW)/2
     
     Notice, we have cut the search area IN HALF (thus the name BINARY) by
     looking at the file JUST ONCE.  And by refining our LOOK record to the
     middle of the new search area which has just been halved, it won't
     take long to find the right record.  The code looks like:
     
        found = 0
        low = 1
        high = numberOfRecords
        DO
           look = (high + low) \ 2  'Integer division
           GET #3, look, ENAM
           IF ENAM.BothNames < SearchingForName$ THEN
              low = look
           ELSEIF ENAM.BothNames > SearchingForName$ THEN
              high = look
           ELSE
              found = 1
              EXIT DO
           END IF
        LOOP UNTIL low = high
        IF found THEN
           'Process the record
        ELSE
     
     The QBNews                                                     Page  7
     Volume  3, Number  2                                     June 29, 1992

           PRINT "Record not found."
        END IF
     
     So eventually, one of two things has to happen, either the record is
     found (ENAM.BothNames = SearchName$) or, if the record is NOT in the
     data base, the HIGH and LOW numbers will run together (Thus the LOOP
     UNTIL low = high).
     
     Again, this may SEEM like a lot of trouble, because, say with 100
     records, the best case is that the record you want is at record #50 (1
     look!) but the WORST case is that it will take 7 looks, for an average
     of about four looks at 100 records to find someone.  (Remember that in
     a sequential file, the worst case could be 100 looks for an average of
     about 50 looks!)
     
     And the beauty of binary search is that YOU CAN DOUBLE THE NUMBER OF
     RECORDS AND ONLY INCREASE THE AVERAGE NUMBER OF LOOKS BY ONE!
     
     So, to sum up the binary search routine, lets look for an employee by
     number and name.
     
     IF input was a number THEN
        Binary search the EMPLNUMB.NDX for the number.
        If found, go to the indicated record in EMPLOYEE.DBF and
        display or process the data.
        If not found, print an error message.
     IF input was a string
        Binary search the EMPLNAME.NDX for the name.
        If found, got to the indicated record in EMPLOYEE.DBF and
        display or process the data.
        If not found, print an error message.
     
     Note that the logic for a number or a string is identical, so you
     write the code for a number looked up in EMPLNUMB.NDX, then make a
     copy of the same code and make the minor modifications of looking for
     a string in EMPLNAME.NDX.
     
     That's how it works!  Indexing may seem bulky or unnecessary at first,
     but I can never emphasize enough the increase in speed you get.  You
     will just have to try it or see it for yourself.
     
     The part of an indexing system that takes the most caution and care is
     the maintenance of the .NDX files.  Remember that whenever you add a
     record, delete a record, or change the data (field) that is in an
     index file, then THAT index file must be IMMEDIATELY updated with the
     correct information for the system to work.  It can be easy if you
     take it a step at a time.
     
     In our example, let's say you add an employee.  The new employee's
     data goes in the next available record of EMPLOYEE.DBF, say number
     1234.  His name is Jones Casey, and his employee number is 2345.  Now
     create a new record for EMPNUMBR.NDX
     
     EmpNumb.  Record Number
     
     The QBNews                                                     Page  8
     Volume  3, Number  2                                     June 29, 1992

       2345       1234
     
     Put this info at the end of the EMPNUMBR.NDX file and reverse bubble
     sort it up to where it belongs.  Do the same with
     
     EmpName     RecordNumber
     Jones Casey  1234
     
     in the EMPLNAME.NDX file.
     
     All maintenance must be done on the fly (right NOW!), because the very
     next operation may be to find the employee you just added, or
     otherwise process the record that was just manipulated, so do it now.
     
     That should acquaint you with the basic theory of index file creating
     and use.  The best way to fully understand the concepts is to set up
     an indexed file and try things out.  I suggest you set up a data base
     with no index, then add one index, and later one other to get you
     started.
     
     **********************************************************************
     Richard Vannoy started programming as a hobby back in 1975.  In 1980,
     he began his career as a computer programmer and teacher upon his
     retirement as a naval submarine officer.  His projects have included 
     
      . Writing a BASIC navigation program to assist in locating objects 
        of value lost on the bottom of the ocean.  
      . Maintaining an extensive technical manual data base on an Imsai 
        8080 in North Star BASIC.  
      . Maintaining a large government contractor accounting program on 
        a PDP-11 with Fortran.  
      . Various contracting/consulting jobs programming with dBASE III+.  
     
     Richard presently resides in Portland, Oregon where he writes software
     for local businesses in QuickBASIC.
     **********************************************************************