QB CULT MAGAZINE
Issue #5 - July-October 2000

BASIC Techniques and Utilities, Chapter 1
An Introduction To Compiled BASIC

By Ethan Winer <ethan@ethanwiner.com>

This chapter explores the internal workings of the BASIC compiler. Many people view a compiler simply as a "black box" which magically transforms BASIC source files into executable code. Of course, magic does not play a part in any computer program, and the BC compiler that comes with Microsoft BASIC is no exception. It is merely a program that processes data in the same way any other program would. In this case, the data is your BASIC source code.

You will learn here what the BASIC compiler does, and how it does it. You will also get an inside glimpse at some of the decisions a compiler must make, as it transforms your code into the assembly language commands the CPU will execute. By truly understanding the compiler's role, you will be able to exploit its strengths and also avoid its weaknesses.

Compiler Fundimentals

No matter what language a program is written in, at some point it must be translated into the binary codes that the PC's processor can understand. Unlike BASIC commands, the CPU within every PC is capable of acting on only very rudimentary instructions. Some typical examples of these instructions are "Add 3 to the value stored in memory location 100", and "Compare the value stored at address 4012 to the number -12 and jump to the code at address 2015 if it is less". Therefore, one very important value of a high-level language such as BASIC is that a programmer can use meaningful names instead of memory addresses when referring to variables and subroutines. Another is the ability to perform complex actions that require many separate small steps using only one or two statements.

As an example, when you use the command PRINT X% in a program, the value of X% must first be converted from its native two-byte binary format into an ASCII string suitable for display. Next, the current cursor location must be determined, at which point the characters in the string are placed into the screen's memory area. Further, the cursor position has to be updated, to place it just past the digits that were printed. Finally, if the last digit happened to end up at the bottom-right corner of the screen, the display must also be scrolled up a line. As you can see, that's an awful lot of activity for such a seemingly simple statement!

A compiler, then, is a program that translates these English-like BASIC source statements into the many separate and tiny steps the microprocessor requires. The BASIC compiler has four major responsibilities, as shown in Figure 1-1 below.

  1. Translate BASIC statements into an equivalent series of assembly language commands.
  2. Assign addresses in memory to hold each of the variables being used by the program.
  3. Remember the addresses in the generated code where each line number or label occurs, for GOTO and GOSUB statements.
  4. Generate additional code to test for events and detect errors when the /v, /w, or /d compile options are used.
Figure 1-1: The primary actions performed by a BASIC compiler.

As the compiler processes a program's source code, it translates only the most basic statements directly into assembly language. For other, more complex statements, it instead generates calls to routines in the BASIC run-time library that is supplied with your compiler. When designing a BASIC program you would most likely identify operations that need to be performed more than once, and then create subprograms or functions rather than add the same code in-line repeatedly. Likewise, the compiler takes advantage of the inherent efficiency of using called subroutines.

For example, when you use a BASIC statement such as PRINT Work$, the compiler processes it as if you had used CALL PRINT(Work$). That is, PRINT really is a called subroutine. Similarly, when you write OPEN FileName$ FOR RANDOM AS #1 LEN = 1024, the compiler treats that as a call to its Open routine, and it creates code identical to CALL OPEN(FileName$, 1, 1024, 4). Here, the first argument is the file name, the second is the file number you specified, the third is the record length, and the value 4 is BASIC's internal code for RANDOM. Because these are BASIC key words, the CALL statement is of course not required. But the end result is identical.

While the BC compiler could certainly create code to print the string or open the file directly, that would be much less efficient than using subroutines. Indeed, all of the subroutines in the Microsoft-supplied libraries are written in assembly language for the smallest size and highest possible performance.

Data Storage

The second important job the compiler must perform is to identify all of the variables and other data your program is using, and allocate space for them in the object file. There are two kinds of data that are manipulated in a BASIC program--static data and dynamic data. The term static data refers to any variable whose address and size does not change during the execution of a program. That is, all simple numeric and TYPE variables, and static numeric and TYPE arrays. String constants such as "Press a key to continue" and DATA items are also considered to be static data, since their contents never change.

Dynamic data is that which changes in size or location when the program runs. One example of dynamic data is a dynamic array, because space to hold its contents is allocated when the program runs. Another is string data, which is constantly moved around in memory as new strings are assigned and old ones are erased. Variable and array storage is discussed in depth in Chapter 2, so I won't belabor that now. The goal here is simply to introduce the concept of variable storage. The important point is that BC deals only with static data, because that must be placed into the object file.

As the compiler processes your source code, it must remember each variable that is encountered, and allocate space in the object file to hold it. Further, all of this data must be able to fit into a single 64K segment, which is called DGROUP (for Data Group). Although the compiled code in each object file may be as large as 64K, static data is combined from all of the files in a multi-module program, and may not exceed 64K in total size. Note that this limitation is inherent in the design of the Intel microprocessors, and has nothing to do with BC, LINK, or DOS.

As each new variable is encountered, room to hold it is placed into the next available data address in the object file. (In truth, the compiler retains all variable information in memory, and writes it to the end of the file all at once following the generated code.) For each integer variable, two bytes are set aside. Long integer and single precision variables require four bytes each, while double precision variables occupy eight bytes. Fixed-length string and TYPE variables use a varying number of bytes, depending on the components you have defined.

Static numeric and TYPE arrays are also written to the object file by the compiler. The number of bytes that are written of course depends on how many elements have been specified in the DIM statement. Also, notice that no matter what type of variable or array is encountered, only zeroes are written to the file. The only exceptions are quoted string constants and DATA items, in which case the actual text must be stored.

Unlike numeric, TYPE, and fixed-length variables, strings must be handled somewhat differently. For each string variable a program uses, a four-byte table called a *string descriptor* is placed into the object file. However, since the actual string data is not assigned until the program is run, space for that data need not be handled by the compiler. With string arrays--whether static or dynamic--a table of four-byte descriptors is allocated.

Finally, each array in the program also requires an array descriptor. This is simply a table that shows where the array's data is located in memory, how many elements it currently holds, the length in bytes of each element, and so forth.

Assembly Language Considerations

In order to fully appreciate how the translation process operates, you will first need to understand what assembly language is all about. Please understand that there is nothing inherently difficult about assembly language. Like BASIC, assembly language is comprised of individual instructions that are executed in sequence. However, each of these instructions does much less than a typical BASIC statement. Therefore, many more steps are required to achieve a given result than in a high-level language. Some of these steps will be shown in the following examples. If you are not comfortable with the idea of tackling assembly language concepts just yet, please feel free to come back to this section at a later time.

Let's begin by examining some very simple BASIC statements, and see how they are translated by the compiler. For simplicity, I will show only integer math operations. The 80x86 family of microprocessors can manipulate integer values directly, as opposed to single and double precision numbers which are much more complex. The short code fragment in Listing 1-1 shows some very simple BASIC instructions, along with the resulting compiled assembly code. In case you are interested, disassemblies such as those you are about to see are easy to create for yourself using the Microsoft CodeView utility. CodeView is included with the Macro Assembler as well as with BASIC PDS.

A% = 12
   MOV  WORD PTR [A%],12    ;move a 12 into the word variable A%

X% = X% + 1
   INC  WORD PTR [X%]       ;add 1 to the word variable X%

Y% = Y% + 100
   ADD  WORD PTR [Y%],100   ;add 100 to the word variable Y%

Z% = A% + B%
   MOV  AX,WORD PTR [B%]    ;move the contents of B% into AX
   ADD  AX,WORD PTR [A%]    ;add to that the value of A%
   MOV  WORD PTR [Z%],AX    ;move the result into Z%
Listing 1-1: These short examples show the compiled results of some simple BASIC math operations.

The first statement, A% = 12, is directly translated to its assembler equivalent. Here, the value 12 is *moved* into the word-sized address named A%. Although an integer is the smallest data type supported by BASIC, the microprocessor can in fact deal with variables as small as one byte. Therefore, the WORD PTR (word pointer) argument is needed to specify that A% is a full two-byte integer, rather than a single byte. Notice that in assembly language, brackets are used to specify the contents of a memory address. This is not unlike BASIC's PEEK() function, where parentheses are used for that purpose.

In the second statement, X% = X% + 1, the compiler generates assembly language code to increment, or add 1 to, the word-sized variable in the location named X%. Since adding or subtracting a value of 1 is such a common operation in all programming languages, the designers of the 80x86 included the INC (and complementary DEC) instruction to handle that.

Y% = Y% + 100 is similarly translated, but in this case to assembler code that adds the value 100 to the word-sized variable at address Y%. As you can see, the simple BASIC statements shown thus far have a direct assembly language equivalent. Therefore, the code that BC creates is extremely efficient, and in fact could not be improved upon even by a human hand-coding those statements in assembly language.

The last statement, Z% = A% + B%, is only slightly more complicated than the others. This is because separate steps are required to retrieve the contents of one memory location, before manipulating it and assigning the result to another location. Here, the value held in variable B% is moved into one of the processor's registers (AX). The value of variable A% is then added to AX, and finally the result is moved into Z%. There are about a dozen registers within the CPU, and you can think of them as special variables that can be accessed very quickly.

The next example in Listing 1-2 shows how BASIC passes arguments to its internal routines, in this case PRINT and OPEN. Whenever a variable is passed to a routine, what is actually sent is the address (memory location) of the variable. This way, the routine can go to that address, and read the value that is stored there. As in Listing 1-1, the BASIC source code is shown along with the resultant compiler-generated assembler instructions.

It may also be worth mentioning that the order in which the arguments are sent to these routines is determined by how the routines are designed. In BASIC, if a SUB is designed to accept, say, three parameters in a certain order, then the caller must pass its arguments in that same order. Parameters in assembler routines are handled in exactly the same manner. Of course, any arbitrary order could be used, and what's important is simply that they match.

PRINT Work$
    MOV  AX,OFFSET Work$     ;move the address of Work$ into AX
    PUSH AX                  ;push that onto the CPU stack
    CALL B$PESD              ;call the string printing routine

OPEN FileName$ FOR OUTPUT AS #1
    MOV  AX,OFFSET FileName$ ;load the address of FileName$
    PUSH AX                  ;push that onto the stack
    MOV  AX,1                ;load the specified file number
    PUSH AX                  ;and push that as well
    MOV  AX,-1               ;-1 means that a LEN= was not given
    PUSH AX                  ;and push that
    MOV  AX,2                ;2 is the internal code for OUTPUT
    PUSH AX                  ;pass that on too
    CALL B$OPEN              ;finally, call the OPEN routine
Listing 1-2: Many BASIC statements create assembler code that passes arguments to internal routines, as shown above.

When you tell BASIC to print a string, it first loads the address of the string into AX, and then pushes that onto the stack. The stack is a special area in memory that all programs can access, and it is often used in compiled languages to hold the arguments being sent to subroutines. In this case, the OFFSET operator tells the CPU to obtain the address where the variable resides, as opposed to the current contents of the variable. Notice that the words offset, address, and memory location all mean the same thing. Also notice that calls in assembly language work exactly the same as calls in BASIC. When the called routine has finished, execution in the main program resumes with the next statement in sequence.

Once the address for Work$ has been pushed, BASIC's B$PESD routine is called. Internally, one of the first things that B$PESD does is to retrieve the incoming address from the stack. This way it can locate the characters that are to be printed. B$PESD is responsible for printing strings, and other BASIC library routines are provided to print each type of data such as integers and single precision values.

In case you are interested, PESD stands for Print End-of-line String Descriptor. Had a semicolon been used in the print statement--that is, PRINT Work$;--then B$PSSD would have been called instead (Print Semicolon String Descriptor). Likewise, printing a 4-byte long integer with a trailing comma as in PRINT Value&, would result in a call to B$PCI4 (Print Comma Integer 4), where the 4 indicates the integer's size in bytes.

In the second example of Listing 1-2 the OPEN routine is set up and called in a similar fashion, except that four parameters are required instead of only one. Again, each parameter is pushed onto the stack in turn, followed by a call to the routine. Most of BASIC's internal routines begin with the characters "B$", to avoid a conflict with subroutines of your own. Since a dollar sign is illegal in a BASIC procedure name, there is no chance that you will inadvertently choose one of the same names that BASIC uses.

As you can see, there is nothing mysterious or even difficult about assembly language, or the translations performed by the BASIC compiler. However, a sequence of many small steps is often needed to perform even simple calculations and assignments. We will discuss assembly language in much greater depth in Chapter 12, and my purpose here is merely to present the underlying concepts.

Compiler Directives

As you have seen, some code is translated by the compiler into the equivalent assembly language statements, while other code is instead converted to calls to the language routines in the BASIC libraries. Some statements, however, are not translated at all. Rather, they are known as *compiler directives* that merely provide information to the compiler as it works. Some examples of these non-executable BASIC statements include DEFINT, OPTION BASE, and REM, as well as the various "metacommands" such as '$INCLUDE and '$DYNAMIC. Some others are SHARED, BYVAL, DATA, DECLARE, CONST, and TYPE.

For our purposes here, it is important to understand that DIM when used on a static array is also a non-executable statement. Because the size of the array is known when the program is compiled, BC can simply set aside memory in the object file to hold the array contents. Therefore, code does not need to be generated to actually create the array. Similarly, TYPE/END TYPE statements also merely define a given number of bytes that will ultimately end up in the program file when the TYPE variable is later dimensioned by your program.

Event and Error Checking

The last compiler responsibility I will discuss here is the generation of additional code to test for events and debugging errors. This occurs whenever a program is compiled using the /d, /w, or /v command line switches. Although event trapping and debugging are entirely separate issues, they are handled in a similar manner. Let's start with event trapping.

When the IBM PC was first introduced, the ability to handle interrupt- driven events distinguished it from its then-current Apple and Commodore counterparts. Interrupts can provide an enormous advantage over polling methods, since polling requires a program to check constantly for, say, keyboard or communications activity. With polling, a program must periodically examine the keyboard using INKEY$, to determine if a key was pressed. But when interrupts are used, the program can simply go about its business, confident that any keystrokes will be processed. Here's how that works:

Each time a key is pressed on a PC, the keyboard generates a hardware interrupt that suspends whatever is currently happening and then calls a routine in the ROM BIOS. That routine in turn reads the character from the keyboard's output port, places it into the PC's keyboard buffer, and returns to the interrupted application. The next time a program looks for a keystroke, that key is already waiting to be read. For example, a program could begin writing a huge multi-megabyte disk file, and any keystrokes will still be handled even if the operator continues to type.

Understand that hardware interrupts are made possible by a direct physical connection between the keyboard circuitry and the PC's microprocessor. The use of interrupts is a powerful concept, and one which is important to understand. Unfortunately, BASIC does not use interrupts in most cases, and this discussion is presented solely in the interest of completeness.

Event Trapping

BASIC provides a number of event handling statements that perhaps *could* be handled via interrupts, but aren't. When you use ON TIMER, for example, code is added to periodically call a central event handler to check if the number of seconds specified has elapsed. Because there are so many possible event traps that could be active at one time, it would be unreasonable to expect BASIC to set up separate interrupts to handle each possibility. In some situations, such as ON KEY, there is a corresponding interrupt. In this case, the keyboard interrupt. However, some events such as ON PLAY(Count), where a GOSUB is made whenever the PLAY buffer has fewer than Count characters remaining, have no corresponding physical interrupt. Therefore, polling for that condition is the only reasonable method.

The example in Listing 1-3 shows what happens when you compile using the /v switch. Notice that the calls to B$EVCK (Event Check) are not part of the original source code. Rather, they show the additional code that BC places just before each program statement.

DEFINT A-Z
    CALL B$EVCK              'this call is generated by BC
ON TIMER(1) GOSUB HandleTime
    CALL B$EVCK              'this call is generated by BC
TIMER ON
    CALL B$EVCK              'this call is generated by BC
X = 10
    CALL B$EVCK              'this call is generated by BC
Y = 100
    CALL B$EVCK              'this call is generated by BC
END

HandleTime:
    CALL B$EVCK              'this call is generated by BC
BEEP
    CALL B$EVCK              'this call is generated by BC
RETURN
Listing 1-3: When the /v compiler switch is used, BC generates calls to a central event handler at each BASIC statement.

At five bytes per call, you can see that using /v can quickly bloat a program to an unacceptable size. One alternative is to instead use /w. In fact, /w can be particularly attractive in those cases where event handling cannot be avoided, because it lets you specify where a call to B$EVCK is made: at each line label or line number in your source code. The only downside to using line numbers and labels is that additional working memory is needed by BC to remember the addresses in the code where those labels are placed. This is not usually a problem, though, unless the program is very large or every line is labeled.

All of the various BASIC event handling commands are specified using the ON statement. It is important to understand, however, that ON GOTO and ON GOSUB do not involve events. That is, they are really just an alternate form of GOTO and GOSUB respectively, and thus do not require compiling with /w or /v.

Error Trapping

The last compiler option to consider here is the /d switch, because it too generates extra code that you might not otherwise be aware of. When a program is compiled with /d, two things are added. First, for every BASIC statement a call is made to a routine named B$LINA, which merely checks to see if Ctrl-Break has been pressed. Normally, a compiled BASIC program is immune to pressing the Ctrl-C and Ctrl-Break keys, except during an INPUT or LINE INPUT statement. Since much of the purpose of a debugging mode is to let you break out of an errant program gone berserk, the Ctrl-Break checking must be performed frequently. These checks are handled in much the same way as event trapping, by calling a special routine once for each line in your source code.

Another important factor resulting from the use of /d is that all array references are handled through a special called routine which ensures that the element number specified is in fact legal. Many people don't realize this, but when a program is compiled without /d and an invalid element is given, BASIC will blindly write to the wrong memory locations. For example, if you use DIM Array%(1 TO 100) and then attempt to assign, say, element number 200, BASIC is glad to oblige. Of course, there *is* no element 200 in that case, and some other data will no doubt be overwritten in the process.

To prevent these errors from going undetected, BC calls the B$HARY (Huge Array) routine to calculate the address based on the element number specified. If B$HARY determines that the array reference is out of bounds, it invokes an internal error handler and you receive the familiar "Subscript out of range" message. Normally, the compiler accesses array elements using as little code as possible, to achieve the highest possible performance. If a static array is dimensioned to 100 elements and you assign element 10, BC knows at the time it compiles your program the address at which that element resides. It can therefore access that element directly, just as if it were a non-array variable.

Even when you use a variable to specify an array element such as Array%(X) = 12, the starting address of the array is known, and the value in X can be used to quickly calculate how far into the array that element is located. Therefore, the lack of bounds checking in programs that do not use /d is not a bug in BASIC. Rather, it is merely a trade-off to obtain very high performance. Indeed, one of the primary purposes of using /d is to let BC find mistakes in your programs during development, though at the cost of execution speed.

The biggest complication from BASIC's point of view is when huge (greater than 64K) arrays are being manipulated. In fact, B$HARY is the very same routine that BC calls when you use the /ah switch to specify huge arrays (hence the name HARY). Since extra code is needed to set up and call B$HARY compared to the normal array access, using /ah also creates programs that are larger and slower than when it is not used. Further, because B$HARY is used by both /d and /ah, invalid element accesses will also be trapped when you compile using /ah.

Overflow Errors

The final result of using /d is that extra code is generated after certain math operations, to check for overflow errors that might otherwise go undetected. Overflow errors are those that result in a value too large for a given data type. For example, if you multiply two integers and the result exceeds 32767, that causes an overflow error. Similarly, an underflow error would be created by a calculation resulting a value that is too small.

When a floating point math operation is performed, errors that result from overflow are detected by the routines that perform the calculation. When that happens there is no recourse other than halting your program with an appropriate message. Integer operations, however, are handled directly by 80x86 instructions. Further, an out of bounds result is not necessarily illegal to the CPU. Thus, programs compiled without the /d option can produce erroneous results, and without any indication that an error occurred.

To prove this to yourself, compile and run the short program shown in Listing 1-4, but without using /d. Although the correct result should be 90000, the answer that is actually displayed is 24464. And you will notice that no error message is displayed! As with illegal array references, BC would rather optimize for speed, and give you the option of using /d as an aid for tracking down such errors as they occur. If you compile the program in Listing 1-4 with the /d option, then BASIC will report the error as expected.

Since an overflow resulting from integer operations is not technically an error as far as the CPU is concerned, how, then, can BASIC trap for that? Although an error in the usual sense is not created, there is a special flag variable within the CPU that is set whenever such a condition occurs. Further, a little-used assembler instruction, INTO (Interrupt 4 if Overflow), will generate software Interrupt 4 if that flag is set. Therefore, all BC has to do is create an Interrupt 4 handler, and then place an INTO instruction after every integer math operation in the compiled code. The interrupt handler will receive control and display an "Overflow" message whenever an INTO calls it. Since the INTO instruction is only one byte and is also very fast, using it this way results in very little size or performance degradation.

X% = 30000
Y% = X% * 10
PRINT Y%
Listing 1-4: This brief program illustrates how overflow errors are handled in BASIC.

Compiler Optimization

Designing a compiler for a language as complex as BASIC involves some very tricky programming indeed. Although it is one thing to translate a BASIC source file into a series of assembly language commands, it is another matter entirely to do it well! Consider that the compiler must be able to accept a BASIC statement such as X! = ABS(SQR((Y# + Z!) ^ VAL(Work$))), and reduce that to the individual steps necessary to arrive at the correct result.

Many, many details must be accounted for and handled, not the least of which are syntax or other errors in the source code. Moreover, there are an infinite number of ways that a programmer can accomplish the same thing. Therefore, the compiler must be able to recognize many different programming patterns, and substitute efficient blocks of assembler code whenever it can. This is the role of an *optimizing compiler*.

One important type of optimization is called *constant folding*. This means that as much math as possible is performed during compilation, rather than creating code to do that when the program runs. For example, if you have a statement such as X = 4 * Y * 3 BC can, and does, change that to X = Y * 12. After all, why multiply 3 times 4 later, when the answer can be determined now? This substitution is performed entirely by the BC compiler, without your knowing about it.

Another important type of optimization is BASIC's ability to remember calculations it has already performed, and use the results again later if possible. BC is especially brilliant in this regard, and it can look ahead many lines in your source code for a repeated use of the same calculations. Listing 1-5 shows a short fragment of BASIC source code, along with the resultant assembler output.

X% = 3 * Y% * 4
    MOV  AX,12               ;move the value 12 into AX
    IMUL WORD PTR [Y%]       ;Integer-Multiply that times Y%
    MOV  WORD PTR [X%],AX    ;assign the result in AX to X%

A% = S% * 100
    MOV  BX,AX               ;save the result from above in BX
    MOV  AX,100              ;then assign AX to 100
    IMUL WORD PTR [S%]       ;now multiply AX times S%
    MOV  WORD PTR [A%],AX    ;and assign A% from the result

Z% = Y% * 12
    MOV  WORD PTR [Z%],BX    ;assign Z% from the earlier result
Listing 1-5: These short code fragments illustrate how adept BC is at reusing the result of earlier calculations already performed.

As you can see in the first part of Listing 1-5, the value of 3 times 4 was resolved to 12 by the compiler. Code was then generated to multiply the 12 times Y%, and the result is in turn assigned to X%. This is similar to the compiled code examined earlier in Listing 1-1. Notice, however, that before the second multiplication of S% is performed, the result currently in AX is saved in the BX register. Although AX is destroyed by the subsequent multiplication of S% times 100, the result that was saved earlier in BX can be used to assign Z% later on. Also notice that even though 3 * 4 was used first, BC was smart enough to realize that this is the same as the 12 used later.

While the compiler can actually look ahead in your source code as it works, such optimization will be thwarted by the presence of line numbers and labels, as well as IF blocks. Since a GOTO or GOSUB could jump to a labeled source line from anywhere in the program, there is no way for BC to be sure that earlier statements were executed in sequence. Likewise, the compiler has no way to know which path in an IF/ELSE block will be taken at run time, and thus cannot optimize across those statements.

The BASIC Run-time Libraries

Microsoft compiled BASIC lets you create two fundamentally different types of programs. Those that are entirely self-contained in one .EXE file are compiled with the /o command line switch. In this case, the compiler creates translations such as those we have already discussed, and also generates calls to the BASIC language routines contained in the library files supplied by Microsoft. When your compiled program is subsequently linked, only those routines that are actually used will be added to your program.

When /o is not used, a completely different method is employed. In this case, a special .EXE file that contains support for every BASIC statement is loaded along with the BASIC program when the program is run from the DOS command line. As you are about to see, there are advantages and disadvantages to each method. For the purpose of this discussion I will refer to stand-alone programs as BCOM programs, after the BCOMxx.LIB library name used in all versions of QuickBASIC. Programs that instead require the BRUNxx.LIB library to be present at run time will be called BRUN programs.

Beginning with BASIC 7 PDS, the library naming conventions used by Microsoft have become more obscure. This is because PDS includes a number of variations for each method, depending on the type of "math package" that is specified when compiling and whether you are compiling a program to run under DOS or OS/2. These variations will be discussed fully in Chapter 5, when we examine all of the possible options that each compiler version has to offer. But for now, we will consider only the two basic methods--BCOM and BRUN. The primary differences between these two types of programs are shown in Figure 1-2.

  1. BCOM programs require less memory, run faster, and do not require the presence of the BRUNxx.EXE file when the program is run.
  2. BRUN programs occupy less disk space, and also allow subsequent chaining to other programs that can share the common library code which is already resident. Chained-to programs also load quickly because the BRUN library is already in memory.
Figure 1-2: A comparison of the fundamental differences between BCOM and BRUN programs.

Stand-alone BCOM programs are always larger than an equivalent BRUN program because the library code for PRINT, INSTR, and so forth is included in the final .EXE file. However, less *memory* will be required when the program runs, since only the code that is really needed is loaded into the PC. Likewise, a BRUN program will take less disk space, because it contains only the compiled code. The actual routines to handle each BASIC statements are stored in the BRUNxx.LIB library, and that library is loaded automatically when the main program is run from DOS.

You might think that since a BRUN program is physically smaller on disk it will load faster, but this is not necessarily true. When you execute a BRUN program from the DOS command line, one of the first things it does is load the BRUN .EXE support file. Since this support file is fairly large, the overall load time will be much greater than the compiled BASIC program's file size would indicate. However, if the main program subsequently chains to another BASIC program, that program will load quickly because the BRUN file does not need to be loaded a second time.

One other important difference between these two methods is the way that the BASIC language routines are accessed. When a BCOM program is compiled and linked, the necessary routines are called in the usual fashion. That is, the compiler generates code that calls the routines in the BCOM library directly. When the program is subsequently linked, the procedure names are translated by LINK into the equivalent memory addresses. That is, a call to PRINT is in effect translated from CALL B$PESD to CALL ####:####, where ####:#### is a segment and address.

BRUN programs, on the other hand, instead use a system of interrupts to access the BASIC language routines. Since there is no way for LINK to know exactly where in memory the BRUNxx.EXE file will be ultimately loaded, the interrupt vector table located in low memory is used to hold the various routine addresses. Although many of these interrupt entries are used by the PC's system resources, many others are available. Again, I will defer a thorough treatment of call methods and interrupts until Chapter 11. But for now, suffice it to say that a direct call is slightly faster than an indirect call, where the address to be called must first be retrieved from a table.

As an interesting aside, the routines in the BRUNxx.EXE file in fact modify the caller's code to perform a direct call, rather than an interrupt instruction. Therefore, the first time a given block of code is executed, it calls the run-time routines through an interrupt instruction. Thereafter, the address where the BRUN file has been loaded is known, and will be used the next time that same block of code is executed. In practice, however, this improves only code that lies within a FOR/NEXT, WHILE, or DO loop. Further, code that is executed only once will actually be much slower than in a BCOM program, because of the added self- modification (the program changes itself) instructions.

Notice that when BC compiles your program, it places the name of the appropriate library into the object file. The name BC uses depends on which compiler options were given. This way you don't have to specify the correct name manually, and LINK can read that name and act accordingly. Although QuickBASIC provides only two libraries--one for BCOM programs and one for BRUN--BASIC PDS offers a number of additional options. Each of these options requires the program to be linked with a different library. That is, there are both BRUN and BCOM libraries for use with OS/2, for near and far strings, and for using IEEE or Microsoft's alternate math libraries. Yet another library is provided for 8087-only operation.

Granularity

Until now, we have examined only the actions and methods used by the BC compiler. However, the process of creating an .EXE file that can be run from the DOS command line is not complete until the compiled object file has been linked to the BASIC libraries. I stated earlier that when a stand-alone program is created using the /o switch, only those routines in the BCOM library that are actually needed will be added to the program. Unfortunately, that is not entirely accurate. While it is true that LINK is very smart and will bring in only those routines that are actually called, there is one catch.

Imagine that you have written a BASIC program which is comprised of two separate modules. In one file is the main program that contains only in- line code, and in the other are two BASIC subprograms. Even if the main program calls only one of those subprograms, both will be added when the program is linked. That is, LINK can resolve routines to the source file level only, but cannot extract a single routine from an object module which contains multiple routines. Since an .LIB library file is merely a collection of separate object modules, all of the routines that reside in a given module will be added to a program, even if only one has been accessed. This property is called *granularity*, and it determines how finely LINK can remove routines from a library.

In the case of the libraries supplied with BASIC, the determining factor is which assembly language routines were combined with which other routines in the same source file by the programmers at Microsoft. In QuickBASIC 4.5, for example, when a program uses the CLS statement, the routines that handle COLOR, CSRLIN, POS(0), LOCATE, and the function form of SCREEN are also added. This is true even if none of those other statements have been used. Fortunately, Microsoft has done much to improve this situation in BASIC PDS, but there is still room for improvement. In BASIC PDS, CLS is stored in a separate file, however POS(0), CSRLIN, and SCREEN are still together, as are COLOR and LOCATE.

Obviously, Microsoft has their reasons for doing what they do, and I won't attempt to second guess their expertise here. The BASIC language libraries are extremely complex and contain many routines. (The QuickBASIC 4.5 BCOM45.LIB file contains 1,485 separate assembler procedures.) With such an enormous number of assembly language source files to deal with, it no doubt makes a lot of sense to organize the related routines together. But it is worth mentioning that Crescent Software's P.D.Q. library can replace much of the functionality of the BCOM libraries, and with complete granularity. In fact, P.D.Q. can create working .EXE programs from BASIC source that are less than 800 bytes in size.

Summary

In this chapter, you learned about the process of compiling, and the kinds of decisions a sophisticated compiler such as Microsoft BASIC must make. In some cases, the BASIC compiler performs a direct translation of your BASIC source code into assembly language, and in others it creates calls to existing routines in the BCOM libraries. Besides creating the actual assembler code, BASIC must also allocate space for all of the data used in a program.

You also learned some basics about assembly language, which will be covered in more detail in Chapter 12. However, examples in upcoming chapters will also use brief assembly language examples to show the relative efficiency of different coding styles. In Chapter 2, you will learn how variables and other data are stored in memory.