The QBNews Page 4 Volume 3, Number 2 June 29, 1992 Indexing 101 by Richard Vannoy So here's the basics of file indexing. I'll use the file extensions .DBF to indicate a normal RANDOM data file and .NDX to show an index file. We'll start with a data file: TYPE EmployeeData lastname AS STRING * 15 firstname AS STRING * 10 employeeNumber AS INTEGER SSN AS STRING * 11 salary AS SINGLE hireDate AS STRING * 8 'Many more fields could (and usually do) go here for a 'typical company data base. Imafine that there are 'perhaps 60 to 80 fields using maybe 400 to 600 bytes 'each. It's important to remember, as we go along, that 'ALL of this data for all the employees is much too 'massive to fit in the memory of most machines. END TYPE DIM EMPL AS EmployeeData Now let's suppose the personnel department wants us to design the data base so they can quickly retrieve an employee's information, and they want to be able to enter EITHER the first and last name OR the employee number. As you could guess, if you were sitting at a terminal all day, you would quickly find out that if you can enter just the employee number instead of the full name, that data entry will be much faster and more efficient. So if we are dealing with documents that have the employee number (as most work type documents do), then that would be quicker. But we also have to allow for the case where the employee number is not known. So, we must also allow the employee's name to be entered also. So we decide to index the data base in two ways; first by employee number: We need a small data structure to contain just two things: the employee number (what we're indexing on), and the record number where that employee's information can be found. So.. TYPE EmpNumberIndex EmpNumber AS SINGLE RecordNumber AS SINGLE END TYPE DIM NUMB AS EmpNumberIndex Then we need a small data structure for the first name/Last name information: TYPE EmpNameIndex BothNames AS STRING * 26 RecordNumber AS SINGLE END TYPE The QBNews Page 5 Volume 3, Number 2 June 29, 1992 DIM ENAM AS EmpNameIndex Note that the BothNames field is big enough to hold the first name, a space, and then the last name. Also note that both EmpNumberIndex and EmpNameIndex have the RecordNumber for where that employee is stored, since we don't need ALL the employee info, just a pointer to where we can find it. We're going to open three files... OPEN "EMPLOYEE.DBF" FOR RANDOM AS #1 LEN = LEN(EMPL) OPEN "EMPNUMBR.NDX" FOR RANDOM AS #2 LEN = LEN(NUMB) OPEN "EMPLNAME.NDX" FOR RANDOM AS #3 LEN = LEN(ENAM) Let's start with three employees. EMPLOYEE.DBF will have: RECLast Name First Name Employee Numb. AND all the------> -- --------- ---------- -------------- rest of the------> 1 Vannoy Richard 46 fields for-------> 2 Que Suzie 32 for several------> 3 McGee Bobbie 23 hundred bytes----> EMPNUMBR.NDX will have: Employee Numb. REC -------------- --- 46 1 32 2 23 3 EMPLNAME.NDX will have: BothNames REC --------- --- Vannoy Richard 1 Que Suzie 2 McGee Bobbie 3 The last thing we have to do before using this data base is to sort each .NDX file so we can use a binary search to quickly find what we want. EMPLNUMBR.NDX will now look like: Employee Numb. REC -------------- --- 23 3 (Sorted by Employee number) 32 2 46 1 EMPLNAME.NDX will now look like: BothNames REC --------- --- McGee Bobbie 3 (Sorted by name) Que Suzie 2 Vannoy Richard 1 The QBNews Page 6 Volume 3, Number 2 June 29, 1992 So far it seems like a lot of work to do all this for just three records, but remember that we probably have hundreds, maybe even thousands of records, so be aware that the problem of NOT having to search thousands of records for a name OR an employee number (and BOTH can never be in the same order in the original .DBF file) is what got us here in the first place. In order to understand the speed of an indexing system, it is necessary to know how a binary search works, so we'll digress a moment to cover it. The first requirement of a binary search is that the records we are searching MUST BE IN ORDER, either numerically or alphabetically, for the search to work. That's why we sorted the .NDX files above. Let's say we are searching the EMPLNAME.NDX file for me (Vannoy Richard), all the records are in alphabetical order by name, there are 1000 records and my name happens to be at record 687. First set LOW to 1 and HIGH to the number of records (1000). the formula for our first LOOK is going to be (HIGH + LOW)\2 or 500. LOOK at record #500. Is the name there (alphabetically) greater than or less than MY name? Well, since I'm at 687, the name at 500 has to be LESS. We now readjust and recompute LOOK as folows: IF NameAt LOOK < MyName THEN LOW = 500 ELSE HIGH = 500 END IF NewLOOK = (HIGH + LOW)/2 Notice, we have cut the search area IN HALF (thus the name BINARY) by looking at the file JUST ONCE. And by refining our LOOK record to the middle of the new search area which has just been halved, it won't take long to find the right record. The code looks like: found = 0 low = 1 high = numberOfRecords DO look = (high + low) \ 2 'Integer division GET #3, look, ENAM IF ENAM.BothNames < SearchingForName$ THEN low = look ELSEIF ENAM.BothNames > SearchingForName$ THEN high = look ELSE found = 1 EXIT DO END IF LOOP UNTIL low = high IF found THEN 'Process the record ELSE The QBNews Page 7 Volume 3, Number 2 June 29, 1992 PRINT "Record not found." END IF So eventually, one of two things has to happen, either the record is found (ENAM.BothNames = SearchName$) or, if the record is NOT in the data base, the HIGH and LOW numbers will run together (Thus the LOOP UNTIL low = high). Again, this may SEEM like a lot of trouble, because, say with 100 records, the best case is that the record you want is at record #50 (1 look!) but the WORST case is that it will take 7 looks, for an average of about four looks at 100 records to find someone. (Remember that in a sequential file, the worst case could be 100 looks for an average of about 50 looks!) And the beauty of binary search is that YOU CAN DOUBLE THE NUMBER OF RECORDS AND ONLY INCREASE THE AVERAGE NUMBER OF LOOKS BY ONE! So, to sum up the binary search routine, lets look for an employee by number and name. IF input was a number THEN Binary search the EMPLNUMB.NDX for the number. If found, go to the indicated record in EMPLOYEE.DBF and display or process the data. If not found, print an error message. IF input was a string Binary search the EMPLNAME.NDX for the name. If found, got to the indicated record in EMPLOYEE.DBF and display or process the data. If not found, print an error message. Note that the logic for a number or a string is identical, so you write the code for a number looked up in EMPLNUMB.NDX, then make a copy of the same code and make the minor modifications of looking for a string in EMPLNAME.NDX. That's how it works! Indexing may seem bulky or unnecessary at first, but I can never emphasize enough the increase in speed you get. You will just have to try it or see it for yourself. The part of an indexing system that takes the most caution and care is the maintenance of the .NDX files. Remember that whenever you add a record, delete a record, or change the data (field) that is in an index file, then THAT index file must be IMMEDIATELY updated with the correct information for the system to work. It can be easy if you take it a step at a time. In our example, let's say you add an employee. The new employee's data goes in the next available record of EMPLOYEE.DBF, say number 1234. His name is Jones Casey, and his employee number is 2345. Now create a new record for EMPNUMBR.NDX EmpNumb. Record Number The QBNews Page 8 Volume 3, Number 2 June 29, 1992 2345 1234 Put this info at the end of the EMPNUMBR.NDX file and reverse bubble sort it up to where it belongs. Do the same with EmpName RecordNumber Jones Casey 1234 in the EMPLNAME.NDX file. All maintenance must be done on the fly (right NOW!), because the very next operation may be to find the employee you just added, or otherwise process the record that was just manipulated, so do it now. That should acquaint you with the basic theory of index file creating and use. The best way to fully understand the concepts is to set up an indexed file and try things out. I suggest you set up a data base with no index, then add one index, and later one other to get you started. ********************************************************************** Richard Vannoy started programming as a hobby back in 1975. In 1980, he began his career as a computer programmer and teacher upon his retirement as a naval submarine officer. His projects have included . Writing a BASIC navigation program to assist in locating objects of value lost on the bottom of the ocean. . Maintaining an extensive technical manual data base on an Imsai 8080 in North Star BASIC. . Maintaining a large government contractor accounting program on a PDP-11 with Fortran. . Various contracting/consulting jobs programming with dBASE III+. Richard presently resides in Portland, Oregon where he writes software for local businesses in QuickBASIC. **********************************************************************