QB CULT MAGAZINE
Vol. 2 Iss. 3 - August 2001

BASIC Techniques and Utilities, Chapter 8
Sorting and Searching

Two fundamental operations required of many applications are searching and sorting the data they operate on. Many different types of data are commonly sorted, such as customer names, payment due dates, or even a list of file names displayed in a file selection menu. If you are writing a programmer's cross reference utility, you may need to sort a list of variable names without regard to capitalization. In some cases, you may want to sort several pieces of related information based on the contents of only one of them. One example of that is a list of names and addresses sorted in ascending zip code order.

Searching is equally important; for example, to locate a customer name in an array or disk file. In some cases you may wish to search for a complete match, while in others a partial match is needed. If you are searching a list of names for, say, Leonard, you probably would want to ignore Leonardo. But when searching a list of zip codes you may need to locate all that begin with the digits 068. There are many different ways sorting and searching can be accomplished, and the subject is by no means a simple one.

Most programmers are familiar with the Bubble Sort, because it is the simplest to understand. Each adjacent pair of items is compared, and then exchanged if they are out of order. This process is repeated over and over, until the entire list has been examined as many times as there are items. Unfortunately, these repeated comparisons make the Bubble Sort an extremely poor performer. Similarly, code to perform a linear search that simply examines each item in succession for a match is easy to grasp, but it will be painfully slow when there are many items.

In this chapter you will learn how sophisticated algorithms that handle these important programming chores operate. You will also learn how to sort data on more than one key. Often, it is not sufficient to merely sort a list of customers by their last name. For example, you may be expected to sort first by last name, then by first name, and finally by balance due. That is, all of the last names would first be sorted. Then within all of the Smiths you would sort again by first name, and for all of the John Smiths sort that subgroup based on how much money is owed.

For completeness I will start each section by introducing sorting and searching methods that are easy to understand, and then progress to the more complex algorithms that are much more effective. Specifically, I will show the Quick Sort and Binary Search algorithms. When there are many thousands of data items, a good algorithm can make the difference between a sort routine that takes ten minutes to complete, and one that needs only a few seconds.

Finally, I will discuss both BASIC and assembly language sort routines. As important as the right algorithm is for good performance, an assembly language implementation will be even faster. Chapter 12 describes how assembly language routines are written and how they work, and in this chapter I will merely show how to use the routines included with this book.

Sorting Fundamentals

Although there are many different ways to sort an array, the simplest sorting algorithm is the Bubble Sort. The name Bubble is used because a FOR/NEXT loop repeatedly examines each adjacent pair of elements in the array, and those that have higher values rise to the top like bubbles in a bathtub. The most common type of sort is ascending, which means that "A" comes before "B", which comes before "C", and so forth. Figure 8-1 shows how the name Zorba ascends to the top of a five-item list of first names.

Initial array contents:

  Element 4    Kathy
  Element 3    Barbara
  Element 2    Cathy
  Element 1    Zorba <


      After 1 pass:

        Element 4    Kathy
        Element 3    Barbara
        Element 2    Zorba <
        Element 1    Cathy


            After 2 passes:

              Element 4    Kathy
              Element 3    Zorba <
              Element 2    Barbara
              Element 1    Cathy


                After 3 passes:

                  Element 4    Zorba <
                  Element 3    Kathy
                  Element 2    Barbara
                  Element 1    Cathy

Figure 8.1: Data ascending a list during a bubble sort.

The Bubble Sort routine that follows uses a FOR/NEXT loop to repeatedly examine an array and exchange elements as necessary, until all of the items are in the correct order.

DEFINT A-Z
DECLARE SUB BubbleSort (Array$())

CONST NumItems% = 20
CONST False% = 0
CONST True% = -1

DIM Array$(1 TO NumItems%)
FOR X = 1 TO NumItems%
  READ Array$(X)
NEXT

CALL BubbleSort(Array$())

CLS
FOR X = 1 TO NumItems%
  PRINT Array$(X)
NEXT

DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary Lou, Marilyn, Keith
END

SUB BubbleSort (Array$()) STATIC

DO
  OutOfOrder = False%                 'assume it's sorted
  FOR X = 1 TO UBOUND(Array$) - 1
    IF Array$(X) > Array$(X + 1) THEN
      SWAP Array$(X), Array$(X + 1)   'if we had to swap
      OutOfOrder = True%              'we may not be done
    END IF
  NEXT
LOOP WHILE OutOfOrder

END SUB

This routine is simple enough to be self-explanatory, and only a few things warrant discussing. One is the OutOfOrder flag variable. When the array is nearly sorted to begin with, fewer passes through the loop are needed. The OutOfOrder variable determines when no more passes are necessary. It is cleared at the start of each loop, and set each time two elements are exchanged. If, after examining all of the elements in one pass no exchanges were required, then the sorting is done and there's no need for the DO loop to continue.

The other item worth mentioning is that the FOR/NEXT loop is set to consider one element less than the array actually holds. This is necessary because each element is compared to the one above it. If the last element were included in the loop, then BASIC would issue a "Subscript out of range" error on the statement that examines Array$(X + 1).

There are a number of features you can add to this Bubble Sort routine. For example, you could sort without regard to capitalization. In that case "adams" would come before "BAKER", even though the lowercase letter "a" has a higher ASCII value than the uppercase letter "B". To add that capability simply use BASIC's UCASE$ (or LCASE$) function as part of the comparisons:

   IF UCASE$(Array$(X)) > UCASE$(Array$(X + 1)) THEN

And to sort based on the eight-character portion that starts six bytes into each string you would use this:

   IF MID$(Array$(X), 5, 8) > MID$(Array$(X + 1), 5, 8) THEN

Although the comparisons in this example are based on just a portion of each string, the SWAP statement must exchange the entire elements. This opens up many possibilities as you will see later in this chapter.

If there is a chance that the strings may contain trailing blanks that should be ignored, you can use RTRIM$ on each pair of elements:

   IF RTRIM$(Array$(X)) > RTRIM$(Array$(X + 1)) THEN

Of course, you can easily combine these enhancements to consider only the characters in the middle after they have been converted to upper or lower case.

Sorting in reverse (descending) order is equally easy; you'd simply replace the greater-than symbol (>) with a less-than symbol (<).

Finally, you can modify the routine to work with any type of data by changing the array type identifier. That is, for every occurrence of Array$ you will change that to Array% or Array# or whatever is appropriate. If you are sorting a numeric array, then different modifications may be in order. For example, to sort ignoring whether the numbers are positive or negative you would use BASIC's ABS (absolute value) function:

   IF ABS(Array!(X)) > ABS(Array!(X + 1)) THEN

It is important to point out that all of the simple modifications described here can also be applied to the more sophisticated sort routines we will look at later in this chapter.

Indexed Sorts

Besides the traditional sorting methods--whether a Bubble Sort or Quick Sort or any other type of sort--there is another category of sort routine you should be familiar with. Where a conventional sort exchanges elements in an array until they are in order, an Index Sort instead exchanges elements in a parallel numeric array of *pointers*. The original data is left intact, so it may still be accessed in its natural order. However, the array can also be accessed in sorted order by using the element numbers contained in the index array.

As with a conventional sort, the comparisons in an indexed sort routine examine each element in the primary array, but based on the element numbers in that index array. If it is determined that the data is out of order, the routine exchanges the elements in the index array instead of the primary array. A modification to the Bubble Sort routine to sort using an index is shown below.

DEFINT A-Z
DECLARE SUB BubbleISort (Array$(), Index())

CONST NumItems% = 20
CONST False% = 0
CONST True% = -1

DIM Array$(1 TO NumItems%)  'this holds the string data
DIM Ndx(1 TO NumItems%)     'this holds the index

FOR X = 1 TO NumItems%
  READ Array$(X)            'read the string data
  Ndx(X) = X                'initialize the index array
NEXT

CALL BubbleISort(Array$(), Ndx())

CLS
FOR X = 1 TO NumItems%
  PRINT Array$(Ndx(X))      'print based on the index
NEXT

DATA Zorba, Cathy, Barbara, Kathy, Josephine
DATA Joseph, Joe, Peter, Arnold, Glen
DATA Ralph, Elli, Lucky, Rocky, Louis
DATA Paula, Paul, Mary lou, Marilyn, Keith

SUB BubbleISort (Array$(), Index()) STATIC

DO
  OutOfOrder = False%                 'assume it's sorted
  FOR X = 1 TO UBOUND(Array$) - 1
    IF Array$(Index(X)) > Array$(Index(X + 1)) THEN
      SWAP Index(X), Index(X + 1)     'if we had to swap
      OutOfOrder% = True%             'we're not done yet
    END IF
  NEXT
LOOP WHILE OutOfOrder%

END SUB

In this indexed sort, all references to the data are through the index array. And when a swap is necessary, it is the index array elements that are exchanged. Note that an indexed sort requires that the index array be initialized to increasing values--even if the sort routine is modified to be descending instead of ascending. Therefore, when BubbleISort is called Ndx(1) must hold the value 1, Ndx(2) is set to 2, and so forth.

In this example the index array is initialized by the caller. However, it would be just as easy to put that code into the subprogram itself. Since you can't pass an array that hasn't yet been dimensioned, it makes the most sense to do both steps outside of the subprogram. Either way, the index array must be assigned to these initial values.

As I mentioned earlier, one feature of an indexed sort is that it lets you access the data in both its original and sorted order. But there are other advantages, and a disadvantage as well. The disadvantage is that each comparison takes slightly longer, because of the additional overhead required to first look up the element number in the index array, to determine which elements in the primary array will be compared. In some cases, though, that can be more than offset by requiring less time to exchange elements.

If you are sorting an array of 230-byte TYPE variables, the time needed for SWAP to exchange the elements can become considerable. Every byte in both elements must be read and written, so the time needed increases linearly as the array elements become longer. Contrast that with the fixed two bytes in the integer index array that are swapped.

Another advantage of an indexed sort is that it lends itself to sorting more data than can fit in memory. As you will see later in the section that shows how to sort files, it is far easier to manipulate an integer index than an entire file. Further, sorting the file data using multiple passes requires twice as much disk space as the file already occupies.

Data Manipulation Techniques

Before I show the Quick Sort algorithm that will be used as a basis for the remaining sort examples in this chapter, you should also be aware of a few simple tricks that can help you maintain and sort your data. One was described in Chapter 6, using a pair of functions that pack and unpack dates such that the year is stored before the month, which in turn is before the day. Thus, date strings are reduced to only three characters each, and they can be sorted directly.

Another useful speed-up trick is to store string data as integers or long integers. If you had a system of four-digit account numbers you could use an integer instead of a string. Besides saving half the memory and disk space, the integer comparisons in a sort routine will be many times faster than a comparison on string equivalents. Zip codes are also suited to this, and could be stored in a long integer. Even though the space savings is only one byte, the time needed to compare the values for sorting will be greatly reduced.

This brings up another important point. As you learned in Chapter 2, all conventional (not fixed-length) strings require more memory than might be immediately apparent. Besides the amount of memory needed to hold the data itself, four additional bytes are used for a string descriptor, and two more beyond those for a back pointer. Therefore, a zip code stored as a string will actually require eleven bytes rather than the five you might expect. With this in mind, you may be tempted to think that using a fixed- length string to hold the zip codes will solve the problem. Since fixed- length strings do not use either descriptors or back pointers, they do not need the memory they occupy. And that leads to yet another issue.

Whenever a fixed-length string or the string portion of a TYPE variable is compared, it must first be converted to a regular descriptored string. BASIC has only one string comparison routine, and it expects the addresses for two conventional string descriptors. Every time a fixed-length string is used as an argument for comparison, BASIC must create a temporary copy, call its comparison routine, and then delete the copy. This copying adds code and wastes an enormous amount of time; in many cases the copying will take longer than the comparison itself. Therefore, using integers and long integers for numeric data where possible will provide more improvement than just the savings in memory use.

In some cases, however, you must use fixed-length string or TYPE arrays. In particular, when sorting information from a random access disk file it is most sensible to load the records into a TYPE array. And as you learned in Chapter 2, the string components of a TYPE variable or array element are handled by BASIC as a fixed-length string. So how can you effectively sort fixed-length string arrays without incurring the penalty BASIC's overhead imposes? With assembly language subroutines, of course!

Rather than ask BASIC to pass the data using its normal methods, assembly language routines can be invoked passing the data segments and addresses directly. When you use SEG, or a combination of VARSEG and VARPTR with fixed-length and TYPE variables, BASIC knows that you want the segmented address of the variable or array element. Thus, you are tricking BASIC into not making a copy as it usually would when passing such data. An assembly language subroutine or function can be designed to accept data addresses in any number of ways. As you will see later when we discuss sorting on multiple keys, extra trickery is needed to do the same thing in a BASIC procedure.

The three short assembly language functions that follow compare two portions of memory, and then return a result that can be tested by your program.

;COMPARE.ASM - compares two ranges of memory

.Model Medium, Basic
.Code

Compare Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld                ;compare in the forward direction
    Mov  SI,NumBytes   ;get the address for NumBytes%
    Mov  CX,[SI]       ;put it into CX for comparing below

    Les  DI,SegAdr1    ;load ES:DI with the first
                       ;  segmented address
    Lds  SI,SegAdr2    ;load DS:SI with the second
                       ;  segmented address

    Repe Cmpsb         ;do the compare
    Mov  AX,0          ;assume the bytes didn't match
    Jne  Exit          ;we were right, skip over
    Dec  AX            ;wrong, decrement AX down to -1

Exit:
    Ret                ;return to BASIC

Compare Endp
End

;COMPARE2.ASM - compares memory case-insensitive

.Model Medium, Basic
.Code

Compare2 Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld                ;compare in the forward direction
    Mov  BX,-1         ;assume the ranges are the same

    Mov  SI,NumBytes   ;get the address for NumBytes%
    Mov  CX,[SI]       ;put it into CX for comparing below
    Jcxz Exit          ;if zero bytes were given, they're
                       ;  the same
    Les  DI,SegAdr1    ;load ES:DI with the first address
    Lds  SI,SegAdr2    ;load DS:SI with the second address

Do:
    Lodsb              ;load the current character from
                       ;  DS:SI into AL
    Call Upper         ;capitalize as necessary
    Mov  AH,AL         ;copy the character to AH
    
    Mov  AL,ES:[DI]    ;load the other character into AL
    Inc  DI            ;point at the next one for later
    Call Upper         ;capitalize as necessary

    Cmp  AL,AH         ;now, are they the same?
    Jne  False         ;no, exit now and show that
    Loop Do            ;yes, continue
    Jmp  Short Exit    ;if we get this far, the bytes are
                       ;  all the same
False:
    Inc  BX            ;increment BX to return zero
    
Exit:
    Mov  AX,BX         ;assign the function output
    Ret                ;return to BASIC

Upper:
    Cmp  AL,"a"        ;is the character below an "a"?
    Jb   Done          ;yes, so we can skip it
    Cmp  AL,"z"        ;is the character above a "z"?
    Ja   Done          ;yes, so we can skip that too
    Sub  AL,32         ;convert to upper case

Done:
    Retn               ;do a near return to the caller

Compare2 Endp
End

;COMPARE3.ASM - case-insensitive, greater/less than

.Model Medium, Basic
.Code

Compare3 Proc Uses DS ES DI SI, SegAdr1:DWord, _
  SegAdr2:DWord, NumBytes:Word

    Cld               ;compare in the forward direction
    Xor  BX,BX        ;assume the ranges are the same

    Mov  SI,NumBytes  ;get the address for NumBytes%
    Mov  CX,[SI]      ;put it into CX for comparing below
    Jcxz Exit         ;if zero bytes were given, they're
                      ;  the same
    Les  DI,SegAdr1   ;load ES:DI with the first address
    Lds  SI,SegAdr2   ;load DS:SI with the second address

Do:
    Lodsb             ;load the current character from
                      ;  DS:SI into AL
    Call Upper        ;capitalize as necessary, remove for
                      ;  case-sensitive
    Mov  AH,AL        ;copy the character to AH

    Mov  AL,ES:[DI]   ;load the other character into AL
    Inc  DI           ;point at the next character for later
    Call Upper        ;capitalize as necessary, remove for
                      ;  case-sensitive

    Cmp  AL,AH        ;now, are they the same?
    Loope Do          ;yes, continue
    Je   Exit         ;we exhausted the data and they're
                      ;  the same
    Mov  BL,1         ;assume block 1 was "greater"
    Ja   Exit         ;we assumed correctly
    Dec  BX           ;wrong, bump BX down to -1
    Dec  BX

Exit:
    Mov  AX,BX        ;assign the function output
    Ret               ;return to BASIC

Upper:
    Cmp  AL,"a"       ;is the character below an "a"?
    Jb   Done         ;yes, so we can skip it
    Cmp  AL,"z"       ;is the character above a "z"?
    Ja   Done         ;yes, so we can skip that too
    Sub  AL,32        ;convert to upper case

Done:
    Retn              ;do a near return to the caller

Compare3 Endp
End

The first Compare routine above simply checks if all of the bytes are identical, and returns -1 (True) if they are, or 0 (False) if they are not. By returning -1 or 0 you can use either

   IF Compare%(Type1, Type2, NumBytes%) THEN

   IF NOT Compare%(Type1, Type2, NumBytes%) THEN

depending on which logic is clearer for your program. Compare2 is similar to Compare, except it ignores capitalization. That is, "SMITH" and Smith" are considered equal. The Compare3 function also compares memory and ignores capitalization, but it returns either -1, 0, or 1 to indicate if the first data range is less than, equal to, or greater than the second.

The correct declaration and usage for each of these routines is shown below. Note that Compare and Compare2 are declared and used in the same fashion.

Compare and Compare2:

   DECLARE FUNCTION Compare%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
     NumBytes%)
   Same = Compare%(Type1, Type2, NumBytes%)

   DECLARE FUNCTION Compare%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
     BYVAL Adr2%, NumBytes%)
   Same = Compare%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)

Here, Same receives -1 if the two TYPE variables or ranges of memory are the same, or 0 if they are not. NumBytes% tells how many bytes to compare.

Compare3:

   DECLARE FUNCTION Compare3%(SEG Type1 AS ANY, SEG Type2 AS ANY, _
     NumBytes%)
   Result = Compare3%(Type1, Type2, NumBytes%)

   DECLARE FUNCTION Compare3%(BYVAL Seg1%, BYVAL Adr1%, BYVAL Seg2%, _
     BYVAL Adr2%, NumBytes%)
   Result = Compare3%(Seg1%, Adr1%, Seg2%, Adr2%, NumBytes%)

Result receives 0 if the two type variables or ranges of memory are the same, -1 if the first is less when compared as strings, or 1 if the first is greater. NumBytes% tells how many bytes are to be to compared. In the context of a sort routine you could invoke Compare3 like this:

IF Compare3%(TypeEl(X), TypeEl(X + 1), NumBytes%) = 1 THEN
  SWAP TypeEl(X), TypeEl(X + 1)
END IF

As you can see, these routines may be declared in either of two ways. When used with TYPE arrays the first is more appropriate and results in slightly less setup code being generated by the compiler. When comparing fixed-length strings or arbitrary blocks of memory (for example, when one of the ranges is on the display screen) you should use the second method. Since SEG does not work correctly with fixed-length strings, if you want to use that more efficient version you must create a dummy TYPE comprised solely of a single string portion:

TYPE FixedLength
  Something AS STRING * 35
END TYPE

Then simply use DIM to create a single variable or an array based on this or a similar TYPE, depending on what your program needs. The requirement to create a dummy TYPE was discussed in Chapter 2, and I won't belabor the reasons again here. These comparison routines will be used extensively in the sort routines presented later in this chapter; however, their value in other, non-sorting situations should also be apparent.

Although these routines are written in assembly language, they are fairly simple to follow. It is important to understand that you do not need to know anything about assembly language to use them. All of the files you need to add these and all of the other routines in this book are contained on the accompanying diskette [here, in the same ZIP file as this text]. Chapter 12 discusses assembly language in great detail, and you can refer there for further explanation of the instructions used.

If you plan to run the programs that follow in the QuickBASIC editor, you must load the BASIC.QLB Quick Library as follows:

   qb program /l basic

Later when you compile these or other programs you must link with the parallel BASIC.LIB file:

   bc program [/o];
   link program , , nul , basic;

If you are using BASIC PDS start QBX using the BASIC7.QLB file, and then link with BASIC7.LIB to produce a stand-alone .EXE program. [VB/DOS users will also use the BASIC7 version.

The Quick Sort Algorithm

It should be obvious to you by now that a routine written in assembly language will always be faster than an equivalent written in BASIC. However, simply translating a procedure to assembly language is not always the best solution. Far more important than which language you use is selecting an appropriate algorithm. The best sorting method I know is the Quick Sort, and a well-written version of Quick Sort using BASIC will be many times faster than an assembly language implementation of the Bubble Sort.

The main problem with the Bubble Sort is that the number of comparisons required grows exponentially as the number of elements increases. Since each pass through the array exchanges only a few elements, many passes are required before the entire array is sorted. The Quick Sort was developed by C.A.R. (Tony) Hoare, and is widely recognized as the fastest algorithm available. In some special cases, such as when the data is already sorted or nearly sorted, the Quick Sort may be slightly slower than other methods. But in most situations, a Quick Sort is many times faster than any other sorting algorithm.

As with the Bubble Sort, there are many different variations on how a Quick Sort may be coded. (You may have noticed that the Bubble Sort shown in Chapter 7 used a nested FOR/NEXT loop, while the one shown here uses a FOR/NEXT loop within a DO/WHILE loop.) A Quick Sort divides the array into sections--sometimes called partitions--and then sorts each section individually. Many implementations therefore use recursion to invoke the subprogram from within itself, as each new section is about to be sorted. However, recursive procedures in any language are notoriously slow, and also consume stack memory at an alarming rate.

The Quick Sort version presented here avoids recursion, and instead uses a local array as a form of stack. This array stores the upper and lower bounds showing which section of the array is currently being considered. Another refinement I have added is to avoid making a copy of elements in the array. As a Quick Sort progresses, it examines one element selected arbitrarily from the middle of the array, and compares it to the elements that lie above and below it. To avoid assigning a temporary copy this version simply keeps track of the selected element number.

When sorting numeric data, maintaining a copy of the element is reasonable. But when sorting strings--especially strings whose length is not known ahead of time--the time and memory required to keep a copy can become problematic. For clarity, the generic Quick Sort shown below uses the copy method. Although this version is meant for sorting a single precision array, it can easily be adapted to sort any type of data by simply changing all instances of the "!" type declaration character.

'******** QSORT.BAS, Quick Sort algorithm demonstration

'Copyright (c) 1991 Ethan Winer

DEFINT A-Z
DECLARE SUB QSort (Array!(), StartEl, NumEls)

RANDOMIZE TIMER         'generate a new series each run

DIM Array!(1 TO 21)             'create an array
FOR X = 1 TO 21                 'fill with random numbers
  Array!(X) = RND(1) * 500      'between 0 and 500
NEXT

FirstEl = 6                     'sort starting here
NumEls = 10                     'sort this many elements

CLS
PRINT "Before Sorting:"; TAB(31); "After sorting:"
PRINT "==============="; TAB(31); "=============="

FOR X = 1 TO 21                 'show them before sorting
  IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
    PRINT "==>";
  END IF
  PRINT TAB(5); USING "###.##"; Array!(X)
NEXT

CALL QSort(Array!(), FirstEl, NumEls)

LOCATE 3
FOR X = 1 TO 21                 'print them after sorting
  LOCATE , 30
  IF X >= FirstEl AND X <= FirstEl + NumEls - 1 THEN
    PRINT "==>";                'point to sorted items
  END IF
  LOCATE , 35
  PRINT USING "###.##"; Array!(X)
NEXT

SUB QSort (Array!(), StartEl, NumEls) STATIC

REDIM QStack(NumEls \ 5 + 10)  'create a stack array

First = StartEl                'initialize work variables
Last = StartEl + NumEls - 1

DO
  DO
    Temp! = Array!((Last + First) \ 2)  'seek midpoint
    I = First
    J = Last

    DO     'reverse both < and > below to sort descending
      WHILE Array!(I) < Temp!
        I = I + 1
      WEND
      WHILE Array!(J) > Temp!
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN SWAP Array!(I), Array!(J)
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack               'delete the stack array

END SUB

Notice that I have designed this routine to allow sorting only a portion of the array. To sort the entire array you'd simply omit the StartEl and NumEls parameters, and assign First and Last from the LBOUND and UBOUND element numbers. That is, you will change these:

   First = StartEl

and

   Last = StartEl + NumEls - 1

to these:

   First = LBOUND(Array!)

and

   Last = UBOUND(Array!)

As I mentioned earlier, the QStack array serves as a table of element numbers that reflect which range of elements is currently being considered. You will need to dimension this array to one element for every five elements in the primary array being sorted, plus a few extra for good measure. In this program I added ten elements, because one stack element for every five main array elements is not enough for very small arrays. For data arrays that have a large amount of duplicated items, you will probably need to increase the size of the stack array.

Note that this ratio is not an absolute--the exact size of the stack that is needed depends on how badly out of order the data is to begin with. Although it is possible that one stack element for every five in the main array is insufficient in a given situation, I have never seen this formula fail. Because the stack is a dynamic integer array that is stored in far memory, it will not impinge on near string memory. If this routine were designed using the normal recursive method, BASIC's stack would be used which is in near memory.

Each of the innermost DO loops searches the array for the first element in each section about the midpoint that belongs in the other section. If the elements are indeed out of order (when I is less than J) the elements are exchanged. This incrementing and comparing continues until I and J cross. At that point, assuming the variable I has not exceeded the upper limits of the current partition, the partition bounds are saved and Last is assigned to the top of the next inner partition level. When the entire partition has been processed, the previous bounds are retrieved, but as a new set of First and Last values. This process continues until no more partition boundaries are on the stack. At that point the entire array is sorted.

On the accompanying disk you will find a program called SEEQSORT.BAS that contains an enhanced version of the QSort demo and subprogram. This program lets you watch the progress of the comparisons and exchanges as they are made, and actually see this complex algorithm operate. Simply load SEEQSORT.BAS into the BASIC editor and run it. A constant named Delay! is defined at the beginning of the program. Increasing its value makes the program run more slowly; decreasing it causes the program to run faster.

An Assembly Language Quick Sort

As fast as the BASIC QuickSort routine is, we can make it even faster. The listing below shows an assembly language version that is between ten and twenty percent faster, depending on which compiler you are using and if the BASIC PDS /fs (far strings) option is in effect.

;SORT.ASM - sorts an entire BASIC string array

.Model Medium, Basic
.Data
    S          DW 0
    F          DW 0
    L          DW 0
    I          DW 0
    J          DW 0
    MidPoint   DW 0

.Code
    Extrn B$SWSD:Proc   ;this swaps two strings
    Extrn B$SCMP:Proc   ;this compares two strings

Sort Proc Uses SI DI ES, Array:Word, Dir:Word

    Cld                 ;all fills and compares are forward
    Push DS             ;set ES = DS for string compares
    Pop  ES
    
    Xor  CX,CX          ;clear CX
    Mov  AX,7376h       ;load AL and AH with the opcodes
                        ;  Jae and Jbe in preparation for
                        ;  code self-modification
    Mov  BX,Dir         ;get the sorting direction
    Cmp  [BX],CX        ;is it zero (ascending sort)?
    Je   Ascending      ;yes, skip ahead
    Xchg AL,AH          ;no exchange the opcodes

Ascending:
    Mov  CS:[X1],AH     ;install correct comparison opcodes
    Mov  CS:[X2],AL     ;  based on the sort direction

    Mov  BX,Array       ;load the array descriptor address
    Mov  AX,[BX+0Eh]    ;save the number of elements
    Dec  AX             ;adjust the number to zero-based
    Jns  L0             ;at least 1 element, continue
    Jmp  L4             ;0 or less elements, get out now!

L0:
    Mov  BX,Array       ;reload array descriptor address
    Mov  BX,[BX]        ;Array$(LBOUND) descriptor address
    Mov  S,SP           ;StackPtr = 0 (normalized to SP)
    Mov  F,CX           ;F = 0
    Mov  L,AX           ;L = Size%

;----- calculate the value of MidPoint
L1:
    Mov  DI,L           ;MidPoint = (L + F) \ 2
    Add  DI,F
    Shr  DI,1
    Mov  MidPoint,DI

    Mov  AX,F           ;I = F
    Mov  I,AX

    Mov  AX,L           ;J = L
    Mov  J,AX

;----- calculate the offset into the descriptor table for Array$(MidPoint)
L1_2:

    Shl  DI,1           ;multiply MidPoint in DI times 4
    Shl  DI,1           ;now DI holds how far beyond Array$(Start)
                        ;  Array$(MidPoint)'s descriptor is
    Add  DI,BX          ;add the array base address to produce the final
                        ;  address for Array$(MidPoint)

;----- calculate descriptor offset for Array$(I)
L2:
    Mov  SI,I           ;put I into SI
    Shl  SI,1           ;as above
    Shl  SI,1           ;now SI holds how far beyond Array$(Start)
                        ;  Array$(I)'s descriptor is
    Add  SI,BX          ;add the base to produce the final descriptor
                        ;  address

    ;IF Array$(I) < Array$(MidPoint) THEN I = I + 1: GOTO L2
    Push BX             ;save BX because B$SCMP trashes it
    Push SI
    Push DI
    Call B$SCMP         ;do the compare
    Pop  BX             ;restore BX

X1 Label Byte           ;modify the code below to "Jbe" if descending sort
    Jae  L2_1           ;Array$(I) isn't less, continue on

    Inc  Word Ptr I     ;I = I + 1
    Jmp  Short L2       ;GOTO L2

;----- calculate descriptor offset for Array$(J)
L2_1:
    Mov  SI,J           ;put J into SI
    Shl  SI,1           ;as above
    Shl  SI,1           ;now SI holds how far beyond Array$(Start)
                        ;  Array$(J)'s descriptor is
    Add  SI,BX          ;add the base to produce the final descriptor
                        ;  address

    ;IF Array$(J) > Array$(MidPoint) THEN J = J - 1: GOTO L2.1
    Push BX             ;preserve BX
    Push SI
    Push DI
    Call B$SCMP         ;do the compare
    Pop  BX             ;restore BX

X2 Label Byte           ;modify the code below to "Jae" if descending sort
    Jbe  L2_2           ;Array$(J) isn't greater, continue on

    Dec  Word Ptr J     ;J = J - 1
    Jmp  Short L2_1     ;GOTO L2.1

L2_2:
    Mov  AX,I           ;IF I > J GOTO L3
    Cmp  AX,J
    Jg   L3             ;J is greater, go directly to L3
    Je   L2_3           ;they're the same, skip the swap

    ;Swap Array$(I), Array$(J)
    Mov  SI,I           ;put I into SI
    Mov  DI,J           ;put J into DI

    Cmp  SI,MidPoint    ;IF I = MidPoint THEN MidPoint = J
    Jne  No_Mid1        ;not equal, skip ahead
    Mov  MidPoint,DI    ;equal, assign MidPoint = J
    Jmp  Short No_Mid2  ;don't waste time comparing again

No_Mid1:
    Cmp  DI,MidPoint    ;IF J = MidPoint THEN MidPoint = I
    Jne  No_Mid2        ;not equal, skip ahead
    Mov  MidPoint,SI    ;equal, assign MidPoint = I

No_Mid2:
    Mov  SI,I           ;put I into SI
    Shl  SI,1           ;multiply times four for the
    Shl  SI,1           ;  for the descriptors
    Add  SI,BX          ;add address for first descriptor

    Mov  DI,J           ;do the same for J in DI
    Shl  DI,1
    Shl  DI,1
    Add  DI,BX

    Push BX             ;save BX because B$SWSD destroys it
    Call B$SWSD         ;and swap 'em good
    Pop  BX

L2_3:
    Inc  Word Ptr I     ;I = I + 1
    Dec  Word Ptr J     ;J = J - 1

    Mov  AX,I           ;IF I <= J GOTO L2
    Cmp  AX,J
    Jg   L3             ;it's greater, skip to L3
    Mov  DI,MidPoint    ;get MidPoint again
    Jmp  L1_2           ;go back to just before L2

L3:
    Mov  AX,I           ;IF I < L THEN PUSH I: PUSH L
    Cmp  AX,L
    Jnl  L3_1           ;it's not less, so skip Pushes

    Push I              ;Push I
    Push L              ;Push L

L3_1:
    Mov  AX,J           ;L = J
    Mov  L,AX

    Mov  AX,F           ;IF F < L GOTO L1
    Cmp  AX,L
    Jnl  L3_2           ;it's not less, jump ahead to L3_2
    Jmp  L1             ;it's less, go to L1

L3_2:
    Cmp  S,SP           ;IF S = 0 GOTO L4
    Je   L4

    Pop  L              ;Pop L
    Pop  F              ;Pop F
    Jmp  L1             ;GOTO L1

L4:
    Ret                 ;return to BASIC

Sort Endp
End

Besides being faster than the BASIC version, the assembly language Sort routine is half the size. This version also supports sorting either forward or backward, but not just a portion of an array. The general syntax is:

   CALL Sort(Array$(), Direction)

Where Array$() is any variable-length string array, and Direction is 0 for ascending, or any other value for descending. Note that this routine calls upon BASIC's internal services to perform the actual comparing and swapping; therefore, the exact same code can be used with either QuickBASIC or BASIC PDS. Again, I refer you forward to Chapter 12 for an explanation of the assembly language commands used in SORT.ASM.

Sorting on Multiple Keys

In many situations, sorting based on one key is sufficient. For example, if you are sorting a mailing list to take advantage of bulk rates you must sort all of the addresses in order by zip code. When considering complex data such as a TYPE variable, it is easy to sort the array based on one component of each element. The earlier Bubble Sort example showed how MID$ could be used to consider just a portion of each string, even though the entire elements were exchanged. Had that routine been designed to operate on a TYPE array, the comparisons would have examined just one component, but the SWAP statements would exchange entire elements:

   IF Array(X).ZipCode > Array(X + 1).ZipCode THEN
     SWAP Array(X), Array(X + 1)
   END IF

This way, each customer's last name, first name, street address, and so forth remain connected to the zip codes that are being compared and exchanged.

There are several ways to sort on more than one key, and all are of necessity more complex than simply sorting based on a single key. One example of a multi-key sort first puts all of the last names in order. Then within each group of identical last names the first names are sorted, and within each group of identical last and first names further sorting is performed on yet another key--perhaps Balance Due. As you can see, this requires you to sort based on differing types of data, and also to compare ranges of elements for the subgroups that need further sorting.

The biggest complication with this method is designing a calling syntax that lets you specify all of the information. A table array must be established to hold the number of keys, the type of data in each key (string, double precision, and so forth), and how many bytes into the TYPE element each key portion begins. Worse, you can't simply use the name of a TYPE component in the comparisons inside the sort routine--which would you use: Array(X).LastName, Array(X).FirstName, or Array(X).ZipCode? Therefore, a truly general multi-key sort must be called passing the address where the array begins in memory, and a table of offsets beyond that address where each component being considered is located.

To avoid this added complexity I will instead show a different method that has only a few minor restrictions, but is much easier to design and understand. This method requires you to position each TYPE component into the key order you will sort on. You will also need to store all numbers that will be used for a sort key as ASCII digits. To sort first on last name, then first name, and then on balance due, the TYPE might be structured as follows:

TYPE Customer
  LastName   AS STRING * 15
  FirstName  AS STRING * 15
  BalanceDue AS STRING * 9
  Street     AS STRING * 32
  City       AS STRING * 15
  State      AS STRING * 2
  ZipCode    AS STRING * 5
  AnyNumber  AS DOUBLE
END TYPE

In most cases the order in which each TYPE member is placed has no consequence. When you refer to TypeVar.LastName, BASIC doesn't care if LastName is defined before or after FirstName in the TYPE structure. Either way it translates your reference to LastName into an address. Having to store numeric data as strings is a limitation, but this is needed only for those TYPE fields that will be used as a sort key.

The key to sorting on multiple items simultaneously is by treating the contiguous fields as a single long field. Since assignments to the string portion of a TYPE variable are handled internally by BASIC's LSET routine, the data in each element will be aligned such that subsequent fields can be treated as an extension of the primary field. Figure 8-2 below shows five TYPE array elements in succession, as they would be viewed by a string comparison routine. This data is defined as a subset of the name and address TYPE shown above, using just the first three fields. Notice that the balance due fields must be right-aligned (using RSET) for the numeric values to be considered correctly.

Type.LastName  Type.FirstName Type.BalanceDue
===============---------------=========
Munro          Jay              8000.00
Smith          John              122.03
Johnson        Alfred          14537.89
Rasmussen      Peter             100.90
Hudson         Cindy              21.22
^              ^              ^
Field 1        Field 2        Field 3
starts here    starts here    starts here

Figure 8-2: Multiple contiguous fields in a TYPE can be treated as a single long field.

Thus, the sort routine would be told to start at the first field, and consider the strings to be 15 + 15 + 9 = 39 characters long. This way all three fields are compared at one time, and treated as a single entity. Additional fields can of course follow these, and they may be included in the comparison or not at your option.

The combination demonstration and subroutine below sorts such a TYPE array on any number of keys using this method, and it has a few additional features as well. Besides letting you confine the sorting to just a portion of the array, you may also specify how far into each element the first key is located. As long as the key fields are contiguous, they do not have to begin at the start of each TYPE. Therefore, you could sort just on the first name field, or on any other field or group of fields.

'TYPESORT.BAS - performs a multi-key sort on TYPE arrays

'Copyright (c) 1991 Ethan Winer

DEFINT A-Z
DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
  BYVAL Length)
DECLARE SUB TypeSort (Segment, Address, ElSize, Offset, KeySize, NumEls)

CONST NumEls% = 23              'this keeps it all on the screen

TYPE MyType
  LastName  AS STRING * 10
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType

'---- Disable (REM out) all but one of the following blocks to test

Offset = 27                 'start sorting with Cents
ElSize = LEN(Array(1))      'the length of each element
KeySize = 2                 'sort on the Cents only

Offset = 21                 'start sorting with Dollars
ElSize = LEN(Array(1))      'the length of each element
KeySize = 8                 'sort Dollars and Cents only

Offset = 11                 'start sorting with FirstName
ElSize = LEN(Array(1))      'the length of each element
KeySize = 18                'sort FirstName through Cents

Offset = 1                  'start sorting with LastName
ElSize = LEN(Array(1))      'the length of each element
KeySize = ElSize            'sort based on all 4 fields

FOR X = 1 TO NumEls%        'build the array from DATA
  READ Array(X).LastName
  READ Array(X).FirstName
  READ Amount$              'format the amount into money
  Dot = INSTR(Amount$, ".")
  IF Dot THEN
    RSET Array(X).Dollars = LEFT$(Amount$, Dot - 1)
    Array(X).Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
  ELSE
    RSET Array(X).Dollars = Amount$
    Array(X).Cents = "00"
  END IF
NEXT

Segment = VARSEG(Array(1))      'show where the array is
Address = VARPTR(Array(1))      '  located in memory
CALL TypeSort(Segment, Address, ElSize, Offset, KeySize, NumEls%)

CLS                             'display the results
FOR X = 1 TO NumEls%
  PRINT Array(X).LastName, Array(X).FirstName,
  PRINT Array(X).Dollars; "."; Array(X).Cents
NEXT

DATA Smith, John, 123.45
DATA Cramer, Phil, 11.51
DATA Hogan, Edward, 296.08
DATA Cramer, Phil, 112.01
DATA Malin, Donald, 13.45
DATA Cramer, Phil, 111.3
DATA Smith, Ralph, 123.22
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.04
DATA Hogan, Edward, 8999.05
DATA Smith, Bob, 123.45
DATA Cramer, Phil, 11.50
DATA Hogan, Edward, 296.88
DATA Malin, Donald, 13.01
DATA Cramer, Phil, 111.1
DATA Smith, Ralph, 123.07
DATA Smith, John, 112.01
DATA Hogan, Edward, 8999.33
DATA Hogan, Edward, 8999.17
DATA Hogan, Edward, 8999.24
DATA Smith, John, 123.05
DATA Cramer, David, 1908.80
DATA Cramer, Phil, 112
END

SUB TypeSort (Segment, Address, ElSize, Displace, KeySize, NumEls) STATIC

REDIM QStack(NumEls \ 5 + 10) 'create a stack array

First = 1                  'initialize working variables
Last = NumEls
Offset = Displace - 1      'decrement once now rather than
                           '  repeatedly later
DO
  DO
    Temp = (Last + First) \ 2   'seek midpoint
    I = First
    J = Last

    DO
      WHILE Compare3%(Segment, Address + Offset + (I - 1) * ElSize, Segment, _
        Address + Offset + (Temp-1) * ElSize, KeySize) = -1 '< 1 for descending
        I = I + 1
      WEND
      WHILE Compare3%(Segment, Address + Offset + (J - 1) * ElSize, Segment, _
        Address + Offset + (Temp-1)  * ElSize, KeySize) = 1 '< -1 for descending
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN
        CALL SwapMem(Segment, Address + (I - 1) * ElSize, Segment, _
          Address + (J - 1) * ElSize, ElSize)
        IF Temp = I THEN
          Temp = J
        ELSEIF Temp = J THEN
          Temp = I
        END IF
      END IF
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack                    'delete the stack array

END SUB

As you can see, this version of the Quick Sort subprogram is derived from the one shown earlier. The important difference is that all of the incoming information is passed as segments, addresses, and bytes, rather than using an explicit array name. But before describing the inner details of the subprogram itself, I'll address the demonstration portion and show how the routine is set up and called.

As with some of the other procedures on the disk that comes with this book, you will extract the TypeSort subprogram and add it to your own programs by loading it as a module, and then using the Move option of BASIC's View Subs menu. You can quickly access this menu by pressing F2, and then use Alt-M to select Move. Once this is done you will unload TYPESORT.BAS using the Alt-F-U menu selection, and answer *No* when asked if you want to save the modified file. You could also copy the TypeSort subprogram into a separate file, and then load that file as a module in each program that needs it.

Although the example TYPE definition here shows only four components, you may of course use any TYPE structure. TypeSort expects six parameters to tell it where in memory the array is located, how far into each element the comparison routines are to begin, the total length of each element, the length of the key fields, and the number of elements to sort.

After defining MyType, the setup portion of TYPESORT.BAS establishes the offset, element size, and key size parameters. As you can see, four different sample setups are provided, and you should add remarking apostrophes to all but one of them. If the program is left as is, the last setup values will take precedence.

The next section reads sample names, addresses and dollar amounts from DATA statements, and formats the dollar amounts as described earlier. The dollar portion of the amounts are right justified into the Dollars field of each element, and the Cents portion is padded with trailing zeros as necessary to provide a dollars and cents format. This way, the value 12.3 will be assigned as 12.30, and 123 will be formatted to 123.00 which gives the expected appearance.

The final setup step is to determine where the array begins in memory. Since you specify the starting segment and address, it is simple to begin sorting at any array element. For example, to sort elements 100 through 200--even if the array is larger than that--you'd use VARSEG(Array(100)) and VARPTR(Array(100) instead of element 1 as shown in this example.

In addition to the starting segment and address of the array, TypeSort also requires you to tell it how many elements to consider. If you are sorting the entire array and the array starts with element 1, this will simply be the UBOUND of the array. If you are sorting just a portion of the array then you give it only the number of elements to be sorted. So to sort elements 100 through 200, the number of elements will be 101. A general formula you can use for calculating this based on element numbers is NumElements = LastElement - FirstElement + 1.

Now let's consider the TypeSort subprogram itself. Since it is more like the earlier QSort program than different, I will cover only the differences here. In fact, the primary difference is in the way comparisons and exchanges are handled. The Compare3 function introduced earlier is used to compare the array elements with the midpoint. Although QSort made a temporary copy of the midpoint element, that would be difficult to do here. Since the routine is designed to work with any type of data--and the size of each element can vary depending on the TYPE structure--it is impractical to make a copy.

While SPACE$ could be used to claim a block of memory into which the midpoint element is copied, there's a much better way: the Temp variable is used to remember the element number itself. The only complication is that once elements I and J are swapped, Temp must be reassigned if it was equal to either of them. (This happens just below the call to SwapMem.) But the simple integer IF test and assignment required adds far less code and is much faster than making a copy of the element.

TypeSort is designed to sort the array in ascending order, and comments in the code show how to change it to sort descending instead. If you prefer to have one subprogram that can do both, you should add an extra parameter, perhaps called Direction. Near the beginning of the routine before the initial outer DO you would add this:

   IF Direction = 0 THEN     'sort ascending
     ICompare = -1
     JCompare = 1
   ELSE                      'sort descending
     ICompare = 1
     JCompare = -1
   END IF

Then, where the results from Compare3 are compared to -1 and 1 replace those comparisons (at the end of each WHILE line) to instead use ICompare and JCompare:

   WHILE Compare3%(...) = ICompare
     I = I + 1
   WEND
   WHILE Compare3%(...) = JCompare
     J = J - 1
   WEND

This way, you are using variables to establish the sorting direction, and those variables can be set either way each time TypeSort is called.

The last major difference is that elements are exchanged using the SwapMem routine rather than BASIC's SWAP statement. While it is possible to call SWAP by aliasing its name as shown in Chapter 5, it was frankly simpler to write a new routine for this purpose. Further, BASIC's SWAP is slower than SwapMem because it must be able to handle variables of different lengths, and also exchange fixed-length and conventional strings. SwapMem is extremely simple, and it works very quickly.

As I stated earlier, the only way to write a truly generic sort routine is by passing segments and addresses and bytes, instead of array names. Although it would be great if BASIC could let you declare a subprogram or function using the AS ANY option to allow any type of data, that simply wouldn't work. As BASIC compiles your program, it needs to know the size and type of each parameter. When you reference TypeVar.LastName, BASIC knows where within TypeVar the LastName portion begins, and uses that in its address calculations. It is not possible to avoid this limitation other than by using addresses as is done here.

Indeed, this is the stuff that C and assembly language programs are made of. In these languages--especially assembly language--integer pointer variables are used extensively to show where data is located and how long it is. However, the formulas used within the Compare3 and SwapMem function calls are not at all difficult to understand.

The formula Address + Offset - (I - 1) * ElSize indicates where the key field of element I begins. Address holds the address of the beginning of the first element, and Offset is added to identify the start of the first sort key. (I - 1) is used instead of I because addresses are always zero- based. That is, the first element in the array from TypeSort's perspective is element 0, even though the calling program considers it to be element 1. Finally, the element number is multiplied times the length of each element, to determine the value that must be added to the starting address and offset to obtain the final address for the data in element I. Please understand that calculations such as these are what the compiler must do each time you access an array element.

Note that if you call TypeSort incorrectly or give it illegal element numbers, you will not receive a "Subscript out of range" error from BASIC. Rather, you will surely crash your PC and have to reboot. This is the danger--and fun--of manipulating pointers directly.

As I stated earlier, the SwapMem routine that does the actual exchanging of elements is very simple, and it merely takes a byte from one element and exchanges it with the corresponding byte in the other. This task is greatly simplified by the use of the XCHG assembly language command, which is similar to BASIC's SWAP statement. Although XCHG cannot swap a word in memory with another word in memory, it can exchange memory with a register. SwapMem is shown in the listing below.

;SWAPMEM.ASM, swaps two sections of memory

.Model Medium, Basic
.Code

SwapMem Proc Uses SI DI DS ES, Var1:DWord, Var2:DWord, NumBytes:Word

    Lds  SI,Var1      ;get the segmented address of the
                      ;  first variable
    Les  DI,Var2      ;and for the second variable
    Mov  CX,NumBytes  ;get the number of bytes to exchange
    Jcxz Exit         ;we can't swap zero bytes!

DoSwap:
    Mov  AL,ES:[DI]   ;get a byte from the second variable
    Xchg AL,[SI]      ;swap it with the first variable
    Stosb             ;complete the swap and increment DI
    Inc  SI           ;point to the next source byte
    Loop DoSwap       ;continue until done

Exit:
    Ret               ;return to BASIC

SwapMem Endp
End

Indexed Sorting on Multiple Keys

Earlier I showed how to modify the simple Bubble Sort routine to sort a parallel index array instead of the primary array. One important reason you might want to do that is to allow access to the primary array in both its original and sorted order. Another reason, and one we will get to shortly, is to facilitate sorting disk files. Although a routine to sort the records in a file could swap the actual data, it takes a long time to read and write that much data on disk. Further, each time you wanted to access the data sorted on a different key, the entire file would need to be sorted again.

A much better solution is to create one or more sorted lists of record numbers, and store those on disk each in a separate file. This lets you access the data sorted by name, or by zip code, or by any other field, without ever changing the actual file. The TypeISort subprogram below is adapted from TypeSort, and it sorts an index array that holds the element numbers of a TYPE array.

'TYPISORT.BAS, indexed multi-key sort for TYPE arrays

DEFINT A-Z

DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
  NumEls, Index())

CONST NumEls% = 23              'this keeps it all on the screen

TYPE MyType
  LastName  AS STRING * 10
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
END TYPE
REDIM Array(1 TO NumEls%) AS MyType
REDIM Index(1 TO NumEls%)   'create the index array

Offset = 1                  'start sorting with LastName
ElSize = LEN(Array(1))      'the length of each element
KeySize = ElSize            'sort based on all 4 fields

FOR X = 1 TO NumEls%        'build the array from DATA
  READ Array(X).LastName
  READ Array(X).FirstName
  READ Amount$            
   ...                      'this continues as already
   ...                      '  shown in TypeSort
NEXT

FOR X = 1 TO NumEls%            'initialize the index
  Index(X) = X - 1              'but starting with 0
NEXT

Segment = VARSEG(Array(1))      'show where the array is
Address = VARPTR(Array(1))      '  located in memory
CALL TypeISort(Segment, Address, ElSize, Offset, KeySize, NumEls%, Index())

CLS                             'display the results
FOR X = 1 TO NumEls%            '+ 1 adjusts to one-based
  PRINT Array(Index(X) + 1).LastName,
  PRINT Array(Index(X) + 1).FirstName,
  PRINT Array(Index(X) + 1).Dollars; ".";
  PRINT Array(Index(X) + 1).Cents
NEXT

DATA Smith, John, 123.45        'this continues as already
  ...                           '  shown in TypeSort
  ...

END

SUB TypeISort (Segment, Address, ElSize, Displace, KeySize, NumEls, _
  Index()) STATIC

REDIM QStack(NumEls \ 5 + 10) 'create a stack

First = 1                     'initialize working variables
Last = NumEls
Offset = Displace - 1         'make zero-based now for speed later

DO
  DO
    Temp = (Last + First) \ 2 'seek midpoint
    I = First
    J = Last

    DO  'change -1 to 1 and 1 to -1 to sort descending
      WHILE Compare3%(Segment, Address + Offset + (Index(I) * ElSize), _
        Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = -1
        I = I + 1
      WEND
      WHILE Compare3%(Segment, Address + Offset + (Index(J) * ElSize), _
        Segment, Address + Offset + (Index(Temp) * ElSize), KeySize) = 1
        J = J - 1
      WEND
      IF I > J THEN EXIT DO
      IF I < J THEN
        SWAP Index(I), Index(J)
        IF Temp = I THEN
          Temp = J
        ELSEIF Temp = J THEN
          Temp = I
        END IF
      END IF
      I = I + 1
      J = J - 1
    LOOP WHILE I <= J

    IF I < Last THEN
      QStack(StackPtr) = I              'Push I
      QStack(StackPtr + 1) = Last       'Push Last
      StackPtr = StackPtr + 2
    END IF

    Last = J
  LOOP WHILE First < Last

  IF StackPtr = 0 THEN EXIT DO          'Done
  StackPtr = StackPtr - 2
  First = QStack(StackPtr)              'Pop First
  Last = QStack(StackPtr + 1)           'Pop Last
LOOP

ERASE QStack                    'delete the stack array

END SUB

As with TypeSort, TypeISort is entirely pointer based so it can be used with any type of data and it can sort multiple contiguous keys. The only real difference is the addition of the Index() array parameter, and the extra level of indirection needed to access the index array each time a comparison is made. Also, when a swap is required, only the integer index elements are exchanged, which simplifies the code and reduces its size. Like TypeSort, you can change the sort direction by reversing the -1 and 1 values used with Compare3, or add a Direction parameter to the list and modify the code to use that.

As with BubbleISort, the index array is initialized to increasing values by the calling program; however, here the first element is set to hold a value of 0 instead of 1. This reduces the calculations needed within the routine each time an address must be obtained. Therefore, when TypeISort returns, the caller must add 1 to the element number held in each index element. This is shown within the FOR/NEXT loop that displays the sorted results.

Sorting Files

With the development of TypeISort complete, we can now use that routine to sort disk files. The sorting strategy will be to determine how many records are in the file, to determine how many separate passes are needed to process the entire file. TypeISort and TypeSort are restricted to working with arrays no larger than 64K (32K in the editing environment), so there is a limit as to how much data may be loaded into memory at one time. These sort routines can accommodate more data when compiled because address calculations that result in values larger than 32767 cause an overflow error in the QB editor. This overflow is in fact harmless, and is ignored in a compiled program unless you use the /d switch.

Although the routines could be modified to perform segment and address arithmetic to accommodate larger arrays, that still wouldn't solve the problem of having more records than can fit in memory at once. Therefore, separate passes must be used to sort the file contents in sections, with each pass writing a temporary index file to disk. A final merge pass then reads each index to determine which pieces fits where, and then writes the final index file. The program FILESORT.BAS below incorporates all of the sorting techniques shown so far, and includes a few custom BASIC routines to improve its performance.

'FILESORT.BAS, indexed multi-key random access file sort

DEFINT A-Z

DECLARE FUNCTION Compare3% (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, _
  BYVAL Adr2, NumBytes)
DECLARE FUNCTION Exist% (FileSpec$)
DECLARE SUB DOSInt (Registers AS ANY)
DECLARE SUB FileSort (FileName$, NDXName$, RecLength, Offset, KeySize)
DECLARE SUB LoadFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SaveFile (FileNum, Segment, Address, Bytes&)
DECLARE SUB SwapMem (BYVAL Seg1, BYVAL Adr1, BYVAL Seg2, BYVAL Adr2, _
  BYVAL Length)
DECLARE SUB TypeISort (Segment, Address, ElSize, Offset, KeySize, _
  NumEls, Index())

RANDOMIZE TIMER                 'create new data each run
DEF FnRand% = INT(RND * 10 + 1) 'returns RND from 1 to 10

TYPE RegType                    'used by DOSInt
  AX AS INTEGER
  BX AS INTEGER
  CX AS INTEGER
  DX AS INTEGER
  BP AS INTEGER
  SI AS INTEGER
  DI AS INTEGER
  FL AS INTEGER
  DS AS INTEGER
  ES AS INTEGER
END TYPE

DIM SHARED Registers AS RegType 'share among all subs
REDIM LastNames$(1 TO 10)       'we'll select names at
REDIM FirstNames$(1 TO 10)      '  random from these

NumRecords = 2988               'how many test records to use
FileName$ = "TEST.DAT"          'really original, eh?
NDXName$ = "TEST.NDX"           'this is the index file name

TYPE RecType
  LastName  AS STRING * 11
  FirstName AS STRING * 10
  Dollars   AS STRING * 6
  Cents     AS STRING * 2
  AnyNumber AS LONG         'this shows that only key
  OtherNum  AS LONG         '  information must be ASCII
END TYPE

FOR X = 1 TO 10             'read the possible last names
  READ LastNames$(X)
NEXT

FOR X = 1 TO 10             'and the possible first names
  READ FirstNames$(X)
NEXT

DIM RecordVar AS RecType    'to create the sample file
RecLength = LEN(RecordVar)  'the length of each record
CLS
PRINT "Creating a test file..."

IF Exist%(FileName$) THEN   'if there's an existing file
  KILL FileName$            'kill the old data from prior
END IF                      '  runs to start fresh

IF Exist%(NDXName$) THEN    'same for any old index file
  KILL NDXName$
END IF


'---- Create some test data and write it to the file
OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
  FOR X = 1 TO NumRecords
    RecordVar.LastName = LastNames$(FnRand%)
    RecordVar.FirstName = FirstNames$(FnRand%)
    Amount$ = STR$(RND * 10000)
    Dot = INSTR(Amount$, ".")
    IF Dot THEN
      RSET RecordVar.Dollars = LEFT$(Amount$, Dot - 1)
      RecordVar.Cents = LEFT$(MID$(Amount$, Dot + 1) + "00", 2)
    ELSE
      RSET RecordVar.Dollars = Amount$
      RecordVar.Cents = "00"
    END IF
    RecordVar.AnyNumber = X
    PUT #1, , RecordVar
  NEXT
CLOSE

'----- Created a sorted index based on the main data file
Offset = 1                  'start sorting with LastName
KeySize = 29                'sort based on first 4 fields
PRINT "Sorting..."
CALL FileSort(FileName$, NDXName$, RecLength, Offset, KeySize)


'----- Display the results
CLS
VIEW PRINT 1 TO 24
LOCATE 25, 1
COLOR 15
PRINT "Press any key to pause/resume";
COLOR 7
LOCATE 1, 1

OPEN FileName$ FOR RANDOM AS #1 LEN = RecLength
OPEN NDXName$ FOR BINARY AS #2
  FOR X = 1 TO NumRecords
    GET #2, , ThisRecord            'get next rec. number
    GET #1, ThisRecord, RecordVar   'then the actual data

    PRINT RecordVar.LastName;       'print each field
    PRINT RecordVar.FirstName;
    PRINT RecordVar.Dollars; ".";
    PRINT RecordVar.Cents

    IF LEN(INKEY$) THEN             'pause on a keypress
      WHILE LEN(INKEY$) = 0: WEND
    END IF
  NEXT
CLOSE

VIEW PRINT 1 TO 24                  'restore the screen
END

DATA Smith, Cramer, Malin, Munro, Passarelli
DATA Bly, Osborn, Pagliaro, Garcia, Winer

DATA John, Phil, Paul, Anne, Jacki
DATA Patricia, Ethan, Donald, Tami, Elli
END


FUNCTION Exist% (Spec$) STATIC  'reports if a file exists

DIM DTA AS STRING * 44          'the work area for DOS
DIM LocalSpec AS STRING * 60    'guarantee the spec is in
LocalSpec$ = Spec$ + CHR$(0)    '  DGROUP for BASIC PDS

Exist% = -1                     'assume true for now

Registers.AX = &H1A00           'assign DTA service
Registers.DX = VARPTR(DTA)      'show DOS where to place it
Registers.DS = VARSEG(DTA)
CALL DOSInt(Registers)

Registers.AX = &H4E00           'find first matching file
Registers.CX = 39               'any file attribute okay
Registers.DX = VARPTR(LocalSpec)
Registers.DS = VARSEG(LocalSpec)
CALL DOSInt(Registers)          'see if there's a match

IF Registers.FL AND 1 THEN      'if the Carry flag is set
  Exist% = 0                    '  there were no matches
END IF

END FUNCTION


SUB FileSort (FileName$, NDXName$, RecLength, Displace, KeySize) STATIC

CONST BufSize% = 32767  'holds the data being sorted
Offset = Displace - 1   'make zero-based for speed later

'----- Open the main data file
FileNum = FREEFILE
OPEN FileName$ FOR BINARY AS #FileNum

'----- Calculate the important values we'll need
NumRecords = LOF(FileNum) \ RecLength
RecsPerPass = BufSize% \ RecLength
IF RecsPerPass > NumRecords THEN RecsPerPass = NumRecords

NumPasses = (NumRecords \ RecsPerPass) - ((NumRecords MOD RecsPerPass) _
  <> 0)
IF NumPasses = 1 THEN
  RecsLastPass = RecsPerPass
ELSE
  RecsLastPass = NumRecords MOD RecsPerPass
END IF

'----- Create the buffer and index sorting arrays
REDIM Buffer(1 TO 1) AS STRING * BufSize
REDIM Index(1 TO RecsPerPass)
IndexAdjust = 1


'----- Process all of the records in manageable groups
FOR X = 1 TO NumPasses

  IF X < NumPasses THEN         'if not the last pass
    RecsThisPass = RecsPerPass  'do the full complement
  ELSE                          'the last pass may have
    RecsThisPass = RecsLastPass '  fewer records to do
  END IF

  FOR Y = 1 TO RecsThisPass     'initialize the index
    Index(Y) = Y - 1            'starting with value of 0
  NEXT

  '----- Load a portion of the main data file
  Segment = VARSEG(Buffer(1))   'show where the buffer is
  CALL LoadFile(FileNum, Segment, Zero, RecsThisPass * CLNG(RecLength))
  CALL TypeISort(Segment, Zero, RecLength, Displace, KeySize, _
    RecsThisPass, Index())

  '----- Adjust the zero-based index to record numbers
  FOR Y = 1 TO RecsThisPass
    Index(Y) = Index(Y) + IndexAdjust
  NEXT

  '----- Save the index file for this pass
  TempNum = FREEFILE
  OPEN "$$PASS." + LTRIM$(STR$(X)) FOR OUTPUT AS #TempNum
  CALL SaveFile(TempNum, VARSEG(Index(1)), Zero, RecsThisPass * 2&)
  CLOSE #TempNum

  '----- The next group of record numbers start this much higher
  IndexAdjust = IndexAdjust + RecsThisPass

NEXT

ERASE Buffer, Index             'free up the memory


'----- Do a final merge pass if necessary
IF NumPasses > 1 THEN

  NDXNumber = FREEFILE
  OPEN NDXName$ FOR BINARY AS #NDXNumber
  REDIM FileNums(NumPasses)        'this holds the file numbers
  REDIM RecordNums(NumPasses)      'this holds record numbers

  REDIM MainRec$(1 TO NumPasses)   'holds main record data
  REDIM Remaining(1 TO NumPasses)  'tracks index files

  '----- Open the files and seed the first round of data
  FOR X = 1 TO NumPasses
    FileNums(X) = FREEFILE
    OPEN "$$PASS." + LTRIM$(STR$(X)) FOR BINARY AS #FileNums(X)
    Remaining(X) = LOF(FileNums(X))   'this is what remains
    MainRec$(X) = SPACE$(RecLength)   'holds main data file

    GET #FileNums(X), , RecordNums(X)     'get the next record number
    RecOffset& = (RecordNums(X) - 1) * CLNG(RecLength) + 1
    GET #FileNum, RecOffset&, MainRec$(X) 'then get the data
  NEXT

  FOR X = 1 TO NumRecords

    Lowest = 1               'assume this is the lowest data in the group
    WHILE Remaining(Lowest) = 0 'Lowest can't refer to a dead index
      Lowest = Lowest + 1       'so seek to the next higher active index
    WEND

    FOR Y = 2 TO NumPasses      'now seek out the truly lowest element
      IF Remaining(Y) THEN      'consider only active indexes
        IF Compare3%(SSEG(MainRec$(Y)), _    '<-- use VARSEG with QB
          SADD(MainRec$(Y)) + Offset,   _
          SSEG(MainRec$(Lowest)),       _    '<-- use VARSEG with QB
          SADD(MainRec$(Lowest)) + Offset, KeySize) = -1 THEN
           Lowest = Y
        END IF
      END IF
    NEXT

    PUT #NDXNumber, , RecordNums(Lowest)     'write the main index
    Remaining(Lowest) = Remaining(Lowest) - 2
    IF Remaining(Lowest) THEN                'if the index is still active
      GET #FileNums(Lowest), , RecordNums(Lowest)
      RecOffset& = (RecordNums(Lowest) - 1) * CLNG(RecLength) + 1
      GET #FileNum, RecOffset&, MainRec$(Lowest)
    END IF

  NEXT

ELSE
  '----- Only one pass was needed so simply rename the index file
  NAME "$$PASS.1" AS NDXName$
END IF

CLOSE                       'close all open files

IF Exist%("$$PASS.*") THEN  'ensure there's a file to kill
  KILL "$$PASS.*"           'kill the work files
END IF

ERASE FileNums, RecordNums  'erase the work arrays
ERASE MainRec$, Remaining

END SUB


SUB LoadFile (FileNum, Segment, Address, Bytes&) STATIC
  IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
  Registers.AX = &H3F00         'read from file service
  Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
  Registers.CX = Bytes&         'how many bytes to load
  Registers.DX = Address        'and at what address
  Registers.DS = Segment        'and at what segment
  CALL DOSInt(Registers)
END SUB


SUB SaveFile (FileNum, Segment, Address, Bytes&) STATIC
  IF Bytes& > 32767 THEN Bytes& = Bytes& - 65536
  Registers.AX = &H4000         'write to file service
  Registers.BX = FILEATTR(FileNum, 2) 'get the DOS handle
  Registers.CX = Bytes&         'how many bytes to load
  Registers.DX = Address        'and at what address
  Registers.DS = Segment        'and at what segment
  CALL DOSInt(Registers)
END SUB


SUB TypeISort (....) STATIC     'as shown in TYPISORT.BAS

END SUB

FILESORT.BAS begins by defining a function that returns a random number between 1 and 10. Although the earlier sort demonstrations simply read the test data from DATA statements, that is impractical when creating thousands of records. Instead, two arrays are filled--one with ten last names and another with ten first names--and these names are drawn from at random. The Registers TYPE variable that is defined is used by three of the supporting routines in this program. RegType is normally associated with CALL Interrupt and InterruptX, but I have written a small-code replacement to mimic InterruptX that works with DOS Interrupt &H21 only. DOSInt accepts just a single Registers argument, instead of the three parameters that BASIC's Interrupt and InterruptX require. Besides adding less code each time it is used, the routine itself is smaller and simpler than InterruptX.

The remainder of the demonstration program should be easy to follow, so I won't belabor its operation; the real action is in the FileSort subprogram.

Like TypeSort and TypeISort, FileSort is entirely pointer based, to accommodate TYPE elements of any size and structure. You provide the name of the main data file to be sorted, the name of an index file to create, and the length and offset of the keys within the disk records. The Displace parameter tells how far into the TYPE structure the key information is located. When calling TypeISort this value is should be one-based, but in the final merge pass where Compare3 is used, a zero-based number is required. Therefore, a copy is made (Offset = Displace - 1) near the beginning of the routine. This way, both are available quickly without having to calculate - 1 repeatedly slowing its operation.

The initial steps FileSort performs are to determine how many records are in the data file, and from that how many records can fit into memory at one time. Once these are known, the number of passes necessary can be easily calculated. An extra step is needed to ensure that RecsPerPass is not greater than the number of records in the file. Just because 200 records can fit into memory at once doesn't mean there are that many records. In most cases where multiple passes are needed the last pass will process fewer records than the others. If there are, say, 700 records and each pass can sort 300, the last pass will sort only 100 records.

Once the pass information is determined, a block of memory is created to hold each portion of the file for sorting. This is the purpose of the Buffer array. REDIM is used to create a 32K chunk of memory that doesn't impinge on available string space.

For each pass that is needed, the number of records in the current pass is determined and the index array is initialized to increasing values. Then, a portion of the main data file is read using the LoadFile subprogram. BASIC does not allow you to read records from a random access file directly into a buffer specified by its address. And even if it did, we can load data much faster than pure BASIC by reading a number of records all at once.

Once the current block of records has been loaded, TypeISort is called to sort the index array. The index array is also saved very quickly using SaveFile, which is the compliment to LoadFile. A unique name is given to each temporary index file such that the first one is named $$PASS.1, the second $$PASS.2, and so forth. By using dollar signs in the name it is unlikely that the routine will overwrite an existing file from another application. Of course, you may change the names to anything else if you prefer.

Notice the extra step that manipulates the IndexAdjust variable. This adjustment is needed because each sort pass returns the index array holding record numbers starting at 0. The first time through, 1 must be added to each element to reflect BASIC's use of record numbers that start at 1. If the first pass sorts, say, 250 records, then the index values 1 through 250 are saved to disk. But the second pass is processing records 251 through 500, so an adjustment value of 251 must be added to each element prior to writing it to disk.

If the data file is small and only one pass was needed, the $$PASS.1 file is simply renamed to whatever the caller specified. Otherwise, a merge pass is needed to determine which record number is the next in sequence based on the results of each pass. Believe it or not, this is the trickiest portion of the entire program. For the sake of discussion, we'll assume that four passes were required to sort the file.

Each of the four index files contains a sequence of record numbers, and all of the records within that sequence are in sorted order. However, there is no relationship between the data records identified in one index file and those in another. Thus, each index file and corresponding data record must be read in turn. A FOR/NEXT loop then compares each of the four records, to see which is truly next in the final sequence. The complication arises as the merge nears completion, because some of the indexes will have become exhausted. This possibility is handled by the Remaining array.

Elements in the Remaining array are initialized to the length of each index file as the indexes are opened. Then, as each index entry is read from disk, the corresponding element is decremented by two to show that another record number was read. Therefore, the current Remaining element must be checked to see if that index has been exhausted. Otherwise, data that was already processed might be considered in the merge comparisons.

The final steps are to close all the open files, delete the temporary index files, and erase the work arrays to free the memory they occupied.

One important point to observe is the use of SSEG to show Compare3 where the MainRec$ elements are located. SSEG is for BASIC 7 only; if you are using QuickBASIC you must change SSEG to VARSEG. SSEG can be used with either near or far strings in BASIC 7, but VARSEG works with near strings only. SSEG is used as the default, so an error will be reported if you are using QuickBASIC. The cursor will then be placed near the comment in the program that shows the appropriate correction.

Searching Fundamentals

As with sorting, searching data effectively also requires that you select an appropriate algorithm. There are many ways to search data, and we will look at several methods here. The easiest to understand is a linear search, which simply examines each item in sequence until a match is found:

FoundAt = 0                   'assume no match

FOR X = 1 TO NumElements      'search all elements
  IF Array$(X) = Sought$ THEN
    FoundAt = X               'remember where it is
    EXIT FOR                  'no need to continue
  END IF
NEXT

IF FoundAt THEN               'if it was found
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"           'otherwise
END IF

For small arrays a linear search is effective and usually fast enough. Also, integer and long integer arrays can be searched reasonably quickly even if there are many elements. But with string data, as the number of elements that must be searched increases, the search time can quickly become unacceptable. This is particularly true when additional features are required such as searching without regard to capitalization or comparing only a portion of each element using MID$. Indeed, many of the same techniques that enhance a sort routine can also be employed when searching.

To search ignoring capitalization you would first capitalize Sought$ outside of the loop, and then use UCASE$ with each element in the comparisons. Using UCASE$(Sought$) repeatedly within the loop is both wasteful and unnecessary:

   Sought$ = UCASE$(Sought$)
    .
    .
   IF UCASE$(Array$(X)) = Sought$ THEN

Likewise, comparing only a portion of each string will require MID$ with each comparison, after using MID$ initially to extract what is needed from Sought$:

   Sought$ = MID$(Sought$, 12, 6)
    .
    .
   IF MID$(Array$(X), 12, 6) = Sought$ THEN

And again, as with sorting, these changes may be combined in a variety of ways. You could even use INSTR to see if the string being searched for is within the array, when an exact match is not needed:

   IF INSTR(UCASE$(Array$(X)), Sought$) THEN

However, each additional BASIC function you use will make the searching slower and slower. Although BASIC's INSTR is very fast, adding UCASE$ to each comparison as shown above slows the overall process considerably.

There are three primary ways that searching can be speeded up. One is to apply simple improvements based on understanding how BASIC works, and knowing which commands are fastest. The other is to select a better algorithm. The third is to translate selected portions of the search routine into assembly language. I will use all three of these techniques here, starting with enhancements to the linear search, and culminating with a very fast binary search for use with sorted data.

One of the slowest operations that BASIC performs is comparing strings. For each string, its descriptor address must be loaded and passed to the comparison routine. That routine must then obtain the actual data address, and examine each byte in both strings until one of the characters is different, or it determines that both strings are the same. As I mentioned earlier, if one or both of the strings are fixed-length, then copies also must be made before the comparison can be performed.

There is another service that the string comparison routine must perform, which is probably not obvious to most programmers and which also impacts its speed. BASIC frequently creates and then deletes temporary strings without your knowing it. One example is the copy it makes of fixed-length strings before comparing them. But there are other, more subtle situations in which this can happen.

For example, when you use IF X$ + Y$ > Z$ BASIC must create a temporary string comprised of X$ + Y$, and then pass that to the comparison routine. Therefore, that routine is also responsible for determining if the incoming string is a temporary copy, and deleting it if so. In fact, all of BASIC's internal routines that accept string arguments are required to do this.

Therefore, one good way to speed searching of conventional (not fixed- length) string arrays is to first compare the lengths. Since strings whose lengths are different can't possibly be the same, this will quickly weed those out. BASIC's LEN function is much faster than its string compare routine, and it offers a simple but effective opportunity to speed things up. LEN is made even faster because it requires only a single argument, as opposed to the two required for the comparison routine.

SLen = LEN(Sought$)       'do this once outside the loop
FOR X = 1 TO NumElements
  IF LEN(Array$(X)) = SLen THEN   'maybe...
    IF Array$(X) = Sought$ THEN   'found it!
      FoundAt = X
      EXIT FOR
    END IF
  END IF
NEXT

Similarly, if the first characters are not the same then the strings can't match either. Like LEN, BASIC's ASC is much faster than the full string comparison routine, and it too can improve search time by eliminating elements that can't possibly match. Depending on the type and distribution of the data in the array, using both LEN and ASCII can result in a very fast linear search:

SLen = LEN(Sought$)
SAsc = ASC(Sought$)
FOR X = 1 TO NumElements
  IF LEN(Array$(X)) = SLen THEN
    IF ASC(Array$(X)) = SAsc THEN
      IF Array$(X) = Sought$ THEN
        ...
      END IF
    END IF
  END IF
NEXT

Notice that the LEN test must always be before the ASC test, to avoid an "Illegal function call" error if the array element is a null string. If all or most of the strings are the same length, then LEN will not be helpful, and ASC should be used alone.

As I mentioned before, when comparing fixed-length string arrays BASIC makes a copy of each element into a conventional string, prior to calling its comparison routine. This copying is also performed when using ASC is used, but not LEN. After all, the length of a fixed-length never changes, and BASIC is smart enough to know the length directly. But then, comparing the lengths of these string is pointless anyway.

Because of the added overhead to make these copies, the performance of a conventional linear search for fixed-length data is generally quite poor. This is a shame, because fixed-length strings are often the only choice when as much data as possible must be kept in memory at once. And fixed- length strings lend themselves perfectly to names and addresses. It should be apparent by now that the best solution for quickly comparing fixed- length string arrays--and the string portion of TYPE arrays too--is with the various Compare functions already shown.

If you are searching for an exact match, then either Compare or Compare2 will be ideal, depending on whether you want to ignore capitalization. If you have only a single string element in each array, you should define a dummy TYPE. This avoids the overhead of having to use both VARSEG and VARPTR as separate arguments. The short example program and SearchType functions that follow search a fixed-length string array for a match.

DEFINT A-Z
DECLARE FUNCTION Compare% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION Compare2% (SEG Type1 AS ANY, SEG Type2 AS ANY, NumBytes)
DECLARE FUNCTION SearchType% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType2% (Array() AS ANY, Sought AS ANY)
DECLARE FUNCTION SearchType3% (Array() AS ANY, Searched AS ANY)

CLS
TYPE FLen                       'this lets us use SEG
  LastName AS STRING * 15
END TYPE

REDIM Array(1 TO 4000) AS FLen  '4000 is a lot of names
DIM Search AS FLen              'best comparing like data

FOR X = 1 TO 4000 STEP 2        'impart some realism
  Array(X).LastName = "Henderson"
NEXT

Array(4000).LastName = "Henson" 'almost at the end
Search.LastName = "Henson"      'find the same name

'----- first time how long it takes using Compare
Start! = TIMER                  'start timing

FOR X = 1 TO 5                  'search five times
   FoundAt = SearchType%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds with Compare"; Done! - Start!
PRINT


'----- then time how long it takes using Compare2
Start! = TIMER                  'start timing

FOR X = 1 TO 5                  'as above
   FoundAt = SearchType2%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds with Compare2"; Done! - Start!
PRINT


'---- finally, time how long it takes using pure BASIC
Start! = TIMER

FOR X = 1 TO 5
   FoundAt = SearchType3%(Array(), Search)
NEXT

IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

Done! = TIMER
PRINT USING "##.## seconds using BASIC"; Done! - Start!
END

FUNCTION SearchType% (Array() AS FLen, Sought AS FLen) STATIC

SearchType% = -1                'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Compare%(Array(X), Sought, LEN(Sought)) THEN
    SearchType% = X             'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT

END FUNCTION


FUNCTION SearchType2% (Array() AS FLen, Sought AS FLen) STATIC

SearchType2% = -1               'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Compare2%(Array(X), Sought, LEN(Sought)) THEN
    SearchType2% = X            'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT

END FUNCTION


FUNCTION SearchType3% (Array() AS FLen, Searched AS FLen) STATIC

SearchType3% = -1               'assume not found

FOR X = LBOUND(Array) TO UBOUND(Array)
  IF Array(X).LastName = Searched.LastName THEN
    SearchType3% = X            'save where it was found
    EXIT FOR                    'and skip what remains
  END IF
NEXT
  
END FUNCTION

When you run this program it will be apparent that the SearchType function is the fastest, because it uses Compare which doesn't perform any case conversions. SearchType2 is only slightly slower with that added overhead, and the purely BASIC function, SearchType3, lags far behind at half the speed. Note that the array is searched five times in succession, to minimize the slight errors TIMER imposes. Longer timings are generally more accurate than short ones, because of the 1/18th second resolution of the PC's system timer.

Binary Searches

This is about as far as we can go using linear searching, and to achieve higher performance requires a better algorithm. The Binary Search is one of the fastest available; however, it requires the data to already be in sorted order. A Binary Search can also be used with a sorted index, and both methods will be described.

Binary searches are very fast, and also very simple to understand. Unlike the Quick Sort algorithm which achieves great efficiency at the expense of being complicated, a Binary Search can be written using only a few lines of code. The strategy is to start the search at the middle of the array. If the value of that element value is less than that of the data being sought, a new halfway point is checked and the process repeated. This way, the routine can quickly zero in on the value being searched for. Figure 8-3 below shows how this works.

13:  Zambia
12:  Sweden
11:  Peru
10:  Mexico  <-- step 2
 9:  Holland
 8:  Germany
 7:  Finland <-- step 1
 6:  England
 5:  Denmark
 4:  China
 3:  Canada
 2:  Austria
 1:  Australia

Figure 8-3: How a Binary Search locates data in a sorted array.

If you are searching for Mexico, the first element examined is number 7, which is halfway through the array. Comparing Mexico to Finland shows that Mexico is greater, so the distance is again cut in half. In this case, a match was found after only two tries--remarkably faster than a linear search that would have required ten comparisons. Even when huge arrays must be searched, data can often be found in a dozen or so tries. One interesting property of a binary search is that it takes no longer to find the last element in the array than the first one.

The program below shows one way to implement a Binary Search.

DEFINT A-Z
DECLARE FUNCTION BinarySearch% (Array$(), Find$)

CLS
PRINT "Creating test data..."

REDIM Array$(1 TO 1000)         'create a "sorted" array
FOR X = 1 TO 1000
  Array$(X) = "String " + RIGHT$("000" + LTRIM$(STR$(X)), 4)
NEXT

PRINT "Searching array..."

FoundAt = BinarySearch%(Array$(), "String 0987")
IF FoundAt >= 0 THEN
  PRINT "Found at element"; FoundAt
ELSE
  PRINT "Not found"
END IF

END


FUNCTION BinarySearch% (Array$(), Find$) STATIC

BinarySearch% = -1              'no matching element yet
Min = LBOUND(Array$)            'start at first element
Max = UBOUND(Array$)            'consider through last

DO
  Try = (Max + Min) \ 2         'start testing in middle

  IF Array$(Try) = Find$ THEN   'found it!
    BinarySearch% = Try         'return matching element
    EXIT DO                     'all done
  END IF

  IF Array$(Try) > Find$ THEN   'too high, cut in half
    Max = Try - 1
  ELSE
    Min = Try + 1               'too low, cut other way
  END IF
LOOP WHILE Max >= Min

END FUNCTION

The BinarySearch function returns either the element number where a match was found, or -1 if the search string was not found. Not using a value of zero to indicate failure lets you use arrays that start with element number 0. As you can see, the simplicity of this algorithm belies its incredible efficiency. The only real problem is that the data must already be in sorted order. Also notice that two string comparisons must be made--one to see if the strings are equal, and another to see if the current element is too high. Although you could use Compare3 which examines the strings once and tells if the data is the same or which is greater, a Binary Search is so fast that this probably isn't worth the added trouble. As you will see when you run the test program, it takes far longer to create the data than to search it!

Besides the usual enhancements that can be applied to the comparisons using UCASE$ or MID$, this function could also be structured to use a parallel index array. Assuming the data is not sorted but the index array is, the modified Binary Search would look like this:

FUNCTION BinaryISearch% (Array$(), Index(), Find$) STATIC

BinaryISearch% = -1             'assume not found
Min = LBOUND(Array$)            'start at first element
Max = UBOUND(Array$)            'consider through last

DO
  Try = (Max + Min) \ 2         'start testing in middle

  IF Array$(Index(Try)) = Find$ THEN    'found it!
    BinaryISearch% = Try        'return matching element
    EXIT DO                     'all done
  END IF

  IF Array$(Index(Try)) > Find$ THEN    'too high, cut
    Max = Try - 1
  ELSE
    Min = Try + 1               'too low, cut other way
  END IF
LOOP WHILE Max >= Min

END FUNCTION

Numeric Arrays

All of the searching techniques considered so far have addressed string data. In most cases, string array searches are the ones that will benefit the most from improved techniques. As you have already seen, BASIC makes copies of fixed-length strings before comparing them, which slows down searching. And the very nature of strings implies that many bytes may have to be compared before determining if they are equal or which string is greater. In most cases, searching a numeric array is fast enough without requiring any added effort, especially when the data is integer or long integer.

However, a few aspects of numeric searching are worth mentioning here. One is avoiding the inevitable rounding errors that are sure to creep into the numbers you are examining. Another is that in many cases, you may not be looking for an exact match. For example, you may need to find the first element that is higher than a given value, or perhaps determine the smallest value in an array.

Unlike strings that are either the same or they aren't, the binary representation of numeric values is not always so precise. Consider the following test which *should* result in a match, but doesn't.

Value! = 1!
Result! = 2!
CLS

FOR X = 1 TO 1000
  Value! = Value! + .001
NEXT

IF Value! = Result! THEN
  PRINT "They are equal"
ELSE
  PRINT "Value! ="; Value!
  PRINT "Result! ="; Result!
END IF

After adding .001 to Value! 1000 times Value! should be equal to 2, but instead it is slightly higher. This is because the binary storage method used by computers simply cannot represent every possible value with absolute accuracy. Even changing all of the single precision exclamation points (!) to double precision pound signs (#) will not solve the problem. Therefore, to find a given value in a numeric array can require some extra trickery.

What is really needed is to determine if the numbers are *very close* to each other, as opposed to exactly the same. One way to accomplish this is to subtract the two, and see if the result is very close to zero. This is shown below.

Value! = 1!
Result! = 2!
CLS

FOR X = 1 TO 1000
  Value! = Value! + .001
NEXT

IF ABS(Value! - Result!) < .0001 THEN
  PRINT "They are equal"
ELSE
  PRINT "Value! ="; Value!
  PRINT "Result! ="; Result!
END IF

Here, the absolute value of the difference between the numbers is examined, and if that difference is very small the numbers are assumed to be the same. Unfortunately, the added overhead of subtracting before comparing slows the comparison even further. There is no simple cure for this, and an array search must apply this subtraction to each element that is examined.

Another common use for numeric array searches is when determining the largest or smallest value. Many programmers make the common mistake shown below when trying to find the largest value in an array.

MaxValue# = 0

FOR X = 1 TO NumElements
  IF Array#(X) > MaxValue# THEN
    MaxValue# = Array#(X)
    Element = X
  END IF
NEXT

PRINT "The largest value found is"; MaxValue#
PRINT "And it was found at element"; Element

The problem with this routine is that it doesn't account for arrays where all of the elements are negative numbers! In that case no element will be greater than the initial MaxValue#, and the routine will incorrectly report zero as the result. The correct method is to obtain the lowest element value, and use that as a starting point:

MaxValue# = Array#(1)

FOR X = 2 TO NumElements
  IF Array#(X) > MaxValue# THEN
    MaxValue# = Array#(X)
  END IF
NEXT

PRINT "The largest value found is"; MaxValue#

Determining the highest value in an array would be handled similarly, except the greater-than symbol (>) would be replaced with a less-than operator (<).

Soundex

The final searching technique I will show is Soundex. It is often useful to search for data based on its sound, for example when you do not know how to spell a person's name. Soundex was invented in the 1920's and has been used since then by, among others, the U.S. Census Bureau. A Soundex code is an alpha-numeric representation of the sound of a word, and it is surprisingly accurate despite its simplicity. The classic implementation of Soundex returns a four-character result code. The first character is the same as the first letter of the word, and the other three are numeric digits coded as shown in Figure 8-4.

    1    B, F, P, V
    2    C, G, J, K, Q, S, X
    3    D, T
    4    L
    5    M, N
    6    R

Figure 8-4: The Soundex code numbers returned for significant letters of the alphabet.

Letters not shown are simply skipped as being statistically insignificant to the sound of the word. In particular, speaking accents often minimize the importance of vowels, and blur their distinction. If the string is short and there are fewer than four digits, the result is simply padded with trailing zeros. One additional rule is that a code digit is never repeated, unless there is an uncoded letter in between. In the listing that follows, two different implementations of Soundex are shown.

'SOUNDEX.BAS, Soundex routines and example

DEFINT A-Z

DECLARE FUNCTION ASoundex$ (Word$)
DECLARE FUNCTION ISoundex% (Word$)

CLS
DO
  PRINT "press Enter alone to exit"
  INPUT "What is the first word"; FWord$
  IF LEN(FWord$) = 0 THEN EXIT DO
  INPUT "What is the second word"; SWord$
  PRINT

  'Test by alpha-numeric soundex
  PRINT "Alpha-Numeric Soundex: "; FWord$; " and ";
  PRINT SWord$; " do ";
  IF ASoundex$(FWord$) <> ASoundex$(SWord$) THEN
    PRINT "NOT ";
  END IF
  PRINT "sound the same."
  PRINT

  'Test by numeric soundex
  PRINT "      Numeric Soundex: "; FWord$; " and ";
  PRINT SWord$; " do ";
  IF ISoundex%(FWord$) <> ISoundex%(SWord$) THEN
    PRINT "NOT ";
  END IF
  PRINT "sound the same."
  PRINT
LOOP
END


FUNCTION ASoundex$ (InWord$) STATIC

  Word$ = UCASE$(InWord$)
  Work$ = LEFT$(Word$, 1) + "000"
  WkPos = 2
  PrevCode = 0

  FOR L = 2 TO LEN(Word$)
    Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
    IF Temp THEN
      Temp = ASC(MID$("111122222222334556", Temp, 1))
      IF Temp <> PrevCode THEN
        MID$(Work$, WkPos) = CHR$(Temp)
        PrevCode = Temp
        WkPos = WkPos + 1
        IF WkPos > 4 THEN EXIT FOR
      END IF
    ELSE
      PrevCode = 0
    END IF
  NEXT

  ASoundex$ = Work$

END FUNCTION


FUNCTION ISoundex% (InWord$) STATIC

  Word$ = UCASE$(InWord$)
  Work$ = "0000"
  WkPos = 1
  PrevCode = 0

  FOR L = 1 TO LEN(Word$)
    Temp = INSTR("BFPVCGJKQSXZDTLMNR", MID$(Word$, L, 1))
    IF Temp THEN
      Temp = ASC(MID$("111122222222334556", Temp, 1))
      IF Temp <> PrevCode THEN
        MID$(Work$, WkPos) = CHR$(Temp)
        PrevCode = Temp
        WkPos = WkPos + 1
        IF WkPos > 4 THEN EXIT FOR
      END IF
    ELSE
      PrevCode = 0
    END IF
  NEXT

  ISoundex% = VAL(Work$)

END FUNCTION

The first function, ASoundex, follows the standard Soundex definition and returns the result as a string. The ISoundex version cheats slightly by coding the first letter as a number, but it returns an integer value instead of a string. Because integer searches are many times faster than string searches, this version will be better when thousands--or even hundreds of thousands--of names must be examined.

An additional benefit of the integer-only method is that it allows for variations on the first letter. For example, if you enter Cane and Kane in response to the prompts from SOUNDEX.BAS ASoundex will not recognize the names as sounding alike where ISoundex will.

Linked Data

No discussion of searching and sorting would be complete without a mention of linked lists and other data links. Unlike arrays where all of the elements lie in adjacent memory locations, linked data is useful when data locations may be disjointed. One example is the linked list used by the DOS File Allocation Table (FAT) on every disk. As I described in Chapter 6, the data in each file may be scattered throughout the disk, and only through a linked list can DOS follow the thread from one sector in a file to another.

Another example where linked data is useful--and the one we will focus on here--is to keep track of memo fields in a database. A memo field is a field that can store freeform text such as notes about a sales contact or a patient's medical history. Since these fields typically require varying lengths, it is inefficient to reserve space for the longest one possible in the main database file. Therefore, most programs store memo fields in a separate disk file, and use a *pointer field* in the main data file to show where the corresponding memo starts in the dedicated memo file. Similarly, a back pointer adjacent to each memo identifies the record that points to it. This is shown in Figure 8-5 below.

Figure 8-5: Pointers relate record numbers to memo file offsets and vice versa.

Here, the pointer in the main data file record is a long integer that holds the byte offset into the memo file where the corresponding memo text begins. And just before the memo text is an integer record number that shows which record this memo belongs to. (If you anticipate more than 65,535 records a long integer must be used instead.) Thus, these pointers provide links between the two files, and relate the information they contain.

When a new record is added to the main file, the memo that goes with it is appended to the end of the memo file. BASIC's LOF function can be used to determine the current end of the memo file, which is then used as the beginning offset for the new memo text. And as the new memo is appended to MEMO.DAT, the first data actually written is the number of the new record in the main data file.

The record number back pointer in the memo file is needed to allow memo data to be edited. Since there's no reasonable way to extend memo text when other memo data follows it, most programs simply abandon the old text, and allocate new space at the end of the file. The abandoned text is then marked as such, perhaps by storing a negative value as the record number. Storing a negative version of the abandoned data's length is ideal, because that both identifies the data as obsolete, and also tells how much farther into the file the next memo is located.

The idea here is that you would periodically run a memo file maintenance program that compacts the file, thus eliminating the wasted space the abandoned memos occupy. This is similar to the DBPACK.BAS utility shown in Chapter 7, and also similar to the way that BASIC compacts string memory when it becomes full. But when an existing memo is relocated in the memo file, the field in the main data file that points to the memo must also be updated. And that's why the record number back pointer is needed: so the compaction program can know which record in the main file must be updated.

The "L" identifier in the memo file in Figure 8-5, shown between the record number and memo text, is a length byte or word that tells how long the text is. If you plan to limit the memo field lengths to 255 or fewer characters, then a single byte is sufficient. Otherwise an integer must be used. An example of code that reads a data record and then its associated memo text is shown below.

GET #MainFile, RecNumber, TypeVar
MemoOffset& = TypeVar.MemoOff
GET #MemoFile, MemoOffset& + 2, MemoLength%
Memo$ = SPACE$(MemoLength%)
GET #MemoFile, , Memo$

The first step reads a record from the main data file into a TYPE variable, and the second determines where in the memo file the memo text begins. Two is added to that offset in the second GET statement, to skip over the record number back pointer which isn't needed here. Once the length of the memo text is known, a string is assigned to that length, and the actual text is read into it.

If you are using long integer record numbers you would of course use MemoOffset& + 4 in the second GET. And if you're using a single byte to hold the memo length you would define a fixed-length string to receive that byte:

DIM Temp AS STRING *1
GET #MemoFile, MemoOffset& + 2, Temp
MemoLength = ASC(Temp)

Since BASIC doesn't offer a byte-sized integer data type, ASC and STR$ can be used to convert between numeric and string formats.

Array Element Insertion and Deletion

The last issue related to array and memory manipulation I want to cover is inserting and deleting elements. If you intend to maintain file indexes or other information in memory and in sorted order, you will need some way to insert a new entry. By the same token, deleting an entry in a database requires that the parallel index entry also be deleted.

The most obvious way to insert or delete elements in an array is with a FOR/NEXT loop. The first example below inserts an element, and the second deletes one.

'----- Insert an element:
Element = 200
InsertValue = 999

FOR X = UBOUND(Array) TO Element + 1 STEP -1
  Array(X) = Array(X - 1)
NEXT
Array(Element) = InsertValue


'----- Delete an element:
Element = 200
FOR X = Element TO UBOUND(Array) - 1
  Array(X) = Array(X + 1)
NEXT
Array(UBOUND(Array)) = 0  'optionally clear last element

For integer, long integer, and fixed-length arrays this is about as efficient as you can get, short of rewriting the code in assembly language. However, with floating point and string arrays the performance is less than ideal. Unless a numeric coprocessor is installed, floating point values are assigned using interrupts and support code in the emulator library. This adds an unnecessary level of complication that also impacts the speed. When strings are assigned the situation is even worse, because of the memory allocation overhead associated with dynamic string management.

A better solution for floating point and string arrays is a series of SWAP statements. The short program below benchmarks the speed difference of the two methods, as it inserts an element into a single precision array.

REDIM Array(1 TO 500)
CLS
Element% = 200
InsertValue = 999

Start = TIMER
FOR A% = 1 TO 500
  FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
    Array(X%) = Array(X% - 1)
  NEXT
  Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when assigning"; Done - Start

Start = TIMER
FOR A% = 1 TO 500
  FOR X% = UBOUND(Array) TO Element% + 1 STEP -1
    SWAP Array(X%), Array(X% - 1)
  NEXT
  Array(Element%) = InsertValue
NEXT
Done = TIMER
PRINT USING "##.## seconds when swapping"; Done - Start

If you run this program in the BASIC environment, the differences may not appear that significant. But when the program is compiled to an executable file, the swapping method is more than four times faster. In fact, you should never compare programming methods using the BASIC editor for exactly this reason. In many cases, the slowness of the interpreting process overshadows significant differences between one approach and another.

String arrays also benefit greatly from using SWAP instead of assignments, though the amount of benefit varies depending on the length of the strings. If you modify the previous program to use a string array, also add this loop to initialize the elements:

   FOR X% = 1 TO 500
     Array$(X%) = "String number" + STR$(X)
   NEXT

With BASIC PDS far strings the difference is only slightly less at about three to one, due to the added complexity of far data. Also, SWAP will always be worse than assignments when inserting or deleting elements in a fixed-length string or TYPE array. An assignment merely copies the data from one location to another. SWAP, however, must copy the data in both directions.

Understand that when using SWAP with conventional string arrays, the data itself is not exchanged. Rather, the four-byte string descriptors are copied. But because BASIC program modules store string data in different segments, extra work is necessary to determine which descriptor goes with which segment. When near strings are being used, only six bytes are exchanged, regardless of the length of the strings. Four bytes hold the descriptors, and two more store the back pointers.

Summary

This chapter explained many of the finer points of sorting and searching all types of data in BASIC. It began with sorting concepts using the simple Bubble Sort as a model, and then went on to explain indexed and multi-key sorts. One way to implement a multi-key sort is by aligning the key fields into adjacent TYPE components. While there are some restrictions to this method, it is fairly simple to implement and also very fast.

The Quick Sort algorithm was shown, and the SEEQSORT.BAS program on the accompanying disk helps you to understand this complex routine by displaying graphically the progress of the comparisons and exchanges as they are performed. Along the way you saw how a few simple modifications to any string sort routine can be used to sort regardless of capitalization, or based on only a portion of a string element.

You also learned that writing a truly general sort routine that can handle any type of data requires dealing exclusively with segment and address pointers. Here, assembly language routines are invaluable for assisting you when performing the necessary comparisons and data exchanges. Although the actual operation of the assembly language routines will be deferred until Chapter 12, such routines may easily be added to a BASIC program using .LIB and .QLB libraries.

I mentioned briefly the usefulness of packing and aligning data when possible, as an aid to fast sorting. In particular, dates can be packed to only three bytes in Year/Month/Day order, and other data such as zip codes can be stored in long integers. Because numbers can be compared much faster than strings, this helps the sorting routines operate more quickly.

Array searching was also discussed in depth, and both linear and binary search algorithms were shown. As with the sorting routines, searching can also employ UCASE$ and MID$ to search regardless of capitalization, or on only a portion of each array element. Two versions of the Soundex algorithm were given, to let you easily locate names and other data based on how they sound.

Besides showing the more traditional searching methods, I presented routines to determine the minimum and maximum values in a numeric array. I also discussed some of the ramifications involved when searching floating point data, to avoid the inevitable rounding errors that might cause a legitimate match to be ignored.

Finally, some simple ways to insert and delete elements in both string and numeric arrays were shown. Although making direct assignments in a loop is the most obvious way to do this, BASIC's often-overlooked SWAP command can provide a significant improvement in speed.

The next chapter will conclude this section about hands-on programming by showing a variety of program optimization techniques.

�

QB CULT MAGAZINEVol. 2 Iss. 3 - August 2001

BASIC Techniques and Utilities, Chapter 8Sorting and Searching