C-manship: The Mystery of Compile and Link

C-MANSHIP

THE MYSTERY OF COMPILE AND LINK

BY CLAYTON WALNUM

A couple off months ago, someone came up to me at a users' group meeting and suggested a topic for C-manship. He told me that, though he had no problem getting the example programs presented in C-manship up and running, he was still confused about what actually goes on during a compilation and link. He also was confused about the different types of files we must manipulate when programming in C—specifically .O and .H files.

As I thought about what this person had told me, I realized that, though we discussed compilation and linking briefly when C-manship first started, we never did actually explore the process in detail. This month we're going to make up for that lack. We're going to find out exactly what happens during a compilation and link, and discuss the differences between the various files that we use during this process.

Stating the obvious

There is one thing that we all have to know before we can go any further with this topic. To some, what I'm about to say may be an obvious fact; to others it may come as a revelation. But whatever group you may fall into, this fact is essential in understanding how your C compiler actually works.

Fact: Every computer understands only one language: machine language. And every program, no matter what language it's written in, must sooner or later be reduced to machine language.

Of course, to completely understand the above fact, we must know exactly what machine language is. If you were to get a listing of a machine-language program, what you would have would be a long list of numbers. There would be no variable names, no labels of any kind, no strings of characters: nothing but numbers. Those numbers represent the instructions the machine understands and the data it needs to perform those instructions. And if we wanted to get very literal about all this, the numbers in our list would all be binary numbers—that is, consisting of nothing but zeros and ones. Usually, to make things easier for the programmer, "memory dumps" produce listings in hexadecimal format.

How a program is converted to machine code varies with the language you may be using. For example, when you run an uncompiled BASIC program, each statement in the program is converted into machine language as it's encountered, rather than the whole program being converted at once. This is why BASIC programs are so slow. BASIC is an example of an "interpreted" language.

Assembly-language programs are as close to machine language as you can get. Each assembly-language statement represents a single machine-language instruction. For this reason, many people confuse the terms "assembly language" and "machine language," but they are really not the same. Assembly language uses mnemonics (easy-to-remember names) for each of the machine-language instructions to make it easier for programmers to remember them. An assembly-language program is not interpreted; it is "assembled." During the assembly process, each of the mnemonics is converted to its machine-language equivalent.

Finally, we get to "compiled" languages, of which C is one. When a program is compiled, all the instructions in the source code are converted into machine language, so that we end up with a runnable program—one that doesn't need to be interpreted. That's why C programs run faster than BASIC programs. Of course, before we have a runnable module, we have to do some linking. We'll get to that in a moment.

Compilation

What exactly goes on during a compilation depends on the compiler you're using. There are really no set rules, except that it's the compiler's responsibility to take the source code and turn it into object code, the machine-language version of the program. To accomplish this, some compilers make several "passes" over the source code, while others, such as Megamax C, make only one pass.

The one-pass compiler is much faster than the others, but that speed comes with certain disadvantages. For instance, a multipass compiler usually converts the source code into assembly code, then assembles the assembly code into the object code. (The Alcyon compiler works this way.) One of the advantages of this multistep process is that the assembly code that is produced by the compiler can be modified by the programmer before it is assembled and linked. This way, the programmer can do some code optimizing on sections of the program that may not run as fast as he'd like. In addition, the assembly-language listings produced by the compiler can be helpful in locating hard-to-find bugs in the program (assuming that you are familiar with 68000 assembly language).

The Megamax compiler is a one-pass compiler. It takes our source code and converts it directly into a machine-language module. Because no assembly-language file is created during the compilation, we don't have the option of "tweeking" the program.

However, to make up for this, Megamax allows us to place assembly-language code directly into our source code, which speeds up sections of our programs that may need optimizing. In addition, you can use a disassembler to turn the object module into assembly code.

Another important thing we need to know about the compiler is that it can substitute machine-language instructions only for text within the source code that it recognizes as C keywords or C operations. Generally, the process goes something like this: The compiler grabs a line of source code and compares what it finds there to a list of instructions it's able to handle. If it finds a match, it writes to the object file it's creating the machine-language code that represents the C instruction it found. If it doesn't find a match, it sets aside the instruction and goes on to the next.

For example, let's say the compiler has just read in this line:

for (x=8; x<10; ++x)

This is a standard FOR...NEXT loop, and the compiler knows exactly what to do with it. The keyword for will be in its list of acceptable instructions and the values to use in the loop are found within the source line itself. The only stumbling block is the variable x. If x has been defined properly, its address will be found in a table of addresses the compiler has built. If x isn't found in the table, the compiler will generate an error.

Now let's say the compiler reads in this line:

v_bar (handle, pxy);

The compiler can check for the variables handle and pxy to make sure that they're in its table. If they're found in the table, the compiler is satisfied. If they're not in the table, an error is generated. But what about the label v__bar? It's a function, not a keyword, so it won't be found in the compiler's list of instructions. The compiler has no idea of what to do with v__bar(), so it just assumes that it'll run across the label for this function somewhere else in the program. It leaves a space for its address and moves on.

If v__bar() happened to be one of our own functions, the compiler would come across it sooner or later and store its address in the space it reserved for that address. (This is called "back patching," and not all compilers do this. Sometimes patching in the address is left to the linker.) But, as you know, v__bar() is a VDI function. The function itself will not be found in our source code. Does this problem upset the compiler? The compiler couldn't care less about the absence of a function. It'll assume that the function we're calling will be found in another module, and pass the problem on to the linker.

Linking

It's important to realize that the code produced by the compiler, even though it's in machine-language form, is not executable. In that object module are many "references" that need to be resolved. such as v__bar() from the above example. Essentially, what the compiler has passed on to the linker is an object module containing all the machine code generated from our source code, but missing much of the machine-language code it needs to become executable.

When the compiler came across our call to v__bar(), for instance, it didn't know where the code for this mysterious function was; so it left a blank for the linker to handle. When we link the program, the linker will add the code needed to perform v__bar() and patch the address of that code into the blank space left by the compiler.

What is the address of v__bar? Well, we don't really know. All (well, almost all) of the programs that run on an ST must be "relocatable"—that is, they must be able to run anywhere in your ST's memory. This causes a problem for the linker when it comes to addresses, because the addresses of functions and data will change depending on where the program is loaded in memory. I said the linker must supply the addresses, right? How can the linker supply an address for a relocatable program that has yet to be loaded in memory?

In a way, it can't. All the addresses generated during the compile and link process are actually offsets from the beginning of the program, and the beginning of the program is given the address of zero. When you load an executable program into your ST's memory, the program loader replaces these offsets with real addresses. Sounds tricky, but there is really nothing to it. All the loader has to do is add the offsets already generated during the compile and link to the address the program is being loaded at. This sum will be an absolute address. Simple, eh? Although we don't know at link time the absolute address of v__bar() (or any other function), we do know where the code for calling this function on a machine-language level can be found: It's in Mega-max's system library, SYSLIB. In fact, SYS-LIB contains the code for calling all the GEM and TOS functions listed in your Megamax manual. (Other compilers have a similar system library, but a different name.)

Notice I said above that SYSLIB contains the code for calling all the functions.

The machine-language code that actually performs v__bar() and the other system functions are built in to your ST's operating system; it's part of GEM. The code found in SYSLIB "binds" the code generated by the compiler to the OS routines. This binding is necessary because the ST's operating system requires a lot of special handling. For instance, a VDI call needs to have some arrays filled in before it can do its work. When programming in C, these arrays are invisible to us. But if we were programming in assembly language, we'd have to handle these arrays ourselves.

So the linker takes the code that was generated by the compiler and attempts to resolve all the missing addresses. In its attempt to do this, the linker will search through any other files you may be linking to, as well as its own system files. When the linker finds the proper label in its table, it adds the machine code for the function to our existing object module and patches in the address of the code. This continues, with the linker constantly adding code and resolving addresses, until it gets to the end of the object-code module, at which point we have a complete program.

The file types

Some people may be confused about all the different file types we encounter when putting together a program in C. There are three we need to be concerned with: .O, .H files and libraries.

The .O files are the object files we've been talking about. They are in machine-code form, but are not as yet executable. They need to be combined by the linker with the code that will make them complete programs.

When developing a program in C, it is advantageous to compile finished portions of the program into separate .O modules. This technique greatly speeds up compile time as our program gets bigger and bigger, since the code we've written previously doesn't need to be compiled every time; it just has to be linked to our new code.

Let's write a simple program that will illustrate some of the things we've been talking about. First, type in the following code under the filename TEST.C and compile it:

main ()
{
print_text ( "This is a test." );
gemdos (0×1);
}

After compilation you should have the file TEST.O on your disk. This file contains the machine-code equivalent of the C program shown above. The compiler has converted everything in the source code except the call to print__text(). The compiler can't do anything about this function because it doesn't know where or what it is. Did the compiler complain? Did you get an error? No. The compiler just assumed we knew what we were doing and left the missing-function problem for the linker to solve.

Now try to link TEST.O. What happened? After searching through all its libraries in vain, the linker told us that it didn't know anything about a function called print__text(). The linker passed the problem back to us. We have to solve the problem by writing the code for print__text(). Type the following under the filename PRINT.C and compile it:

print_text ( string )
char *string;
{
  printf ( "%s\n", string );
}

You should now have on your disk the files TEST.O and PRINT.O. All we have to do to get an executable program is link these two files together. Do that and run the resultant program. It works!

(The linker did more than put together our two object modules; it also added other necessary code, such as the printf() routines from the system libraries.)

Megamax's libraries (SYSLIB, DOUBLE! and ACC.L) are really the same thing as .O files. They each contain the object code necessary to perform certain functions. We already talked about SYSLIB; you know what it is. The file DOUBLE.L is a machine-language module that, when linked into your program, replaces the regular floating point math routines with more accurate ones, allowing you to get greater precision. The ACC.L file needs to be linked to your program whenever you're writing a desk accessory, since desk accessories have to be initialized differently than regular programs. (We talked about desk accessories in the October '88 C-manship.)

Finally, we have the .H files. There is really no mystery here. These "header" files are included with your compiler as a convenience. Because there are hundreds of standard names for various GEM parameters, as well as various standard structures that are used by GEM programmers, it would be silly to have to type all that stuff in every time you want to write a program. To save wear and tear on your keyboard, all the commonly used data structures and names are provided for you. All you have to do is "include" them into your code.

You can do the same sort of thing when writing your own programs. To keep down the size of each module of your program, you can take all the # defines and global data declarations normally found at the top of your program and place them into a separate file. Traditionally, this type of file is given the .H extension. Let's say your main source-code file is called MYPROG.C.

You would then name the header file containing the data mentioned above into a file called MYPROG.H. Then, at the top of your program, you would have the line ^#include MYPROG.H so the compiler would know where the code belongs.

Take a look at the .H files that came with your compiler, and you'll see that they are really nothing more than a collection of ^#defines and data declarations.

Moving along

For some of you, this excursion into the world of compilers and linkers was a rehash of information you were already familiar with. If there was nothing here for you, I apologize. But I know that there are many of you who have been taking the compilation process for granted, and many of you may have run into problems that you couldn't understand because you didn't know what was going on with your compiler. I hope this discussion cleared away some of the clouds. I'm sure you gained some appreciation of what marvelous feats of programming compilers and linkers are.