Assembly Line: File Handling

Assembly Line

File handling, Part 1

by Douglas Weir

Douglas Weir, who was once a technical editor for ST-LOG, is now employed as a technical writer for Wang Computers in Boston. Besides programming, he enjoys classical music and good books.

File operations are the lifeblood of "real" computing. Files are where a program's input often comes from, and the precious output of most programs is usually saved in files. Not to mention the fact that the program itself exists as a disk file before it is loaded by the operating system and run (unless perhaps it's written in a interpreted language like BASIC).

So far we haven't done anything with files, but that will change with this month's episode. After we're finished with the miniseries, we'll know how to do all the usual file operations in assembly language: opening, closing, reading and writing files, as well as using disk directories.

GEMDOS treats disk I/O, as far as it can, as just another kind of "plain" I/O. For example, the function we'll be using to read from a disk file can also be used to read from the keyboard. As a result, the basic disk operations are pretty simple from the programmer's point of view. programming for disk input can even be easier than keyboard I/O, since there are some things that can happen at the keyboard in an "interactive" environment that can be ignored when reading from a disk.

The one great danger with disk I/O is that, if you do something wrong, you can accidentally erase or over-write valuable files—including the source code for the program you're working on! So it's sensible to take some extra precautions. If you have two disk drives, test your I/O on a disk in the drive that doesn't contain your program disk. If you have only one drive, make sure the program disk contains only the program and the necessary data files, and keep a constantly updated backup of the program files(s) on a separate disk. Finally, if you have a hard disk, I strongly recommend that you do your I/O testing on a disk in the floppy drive, not on the hard disk. It goes without saying that, within your program, you should be extra careful to make sure that the I/O drive is (and remains) the one you want it to be, until you're sure all the basic things are working the way they should.

In any case, we shouldn't be able to get into too much trouble as long as we're not writing data to the disk. For that reason, we'll be only opening, reading and closing a disk file this time. Next time we'll begin writing data.

This month's program will, when run, open a file (any file), read it, display its contents on the screen in a special format, close the file and terminate. We'll learn a couple of new 68000 instructions as well as handy feature of GEMDOS. First, the handy feature.

The Command Line

In previous programs we've used keyboard input to get necessary information from the user about various tasks we want to perform. There's nothing wrong with this, except that it can be a bit tedious to write a keyboard input routine just to get, say, a single vital piece of data (and no more) from the user. This month we find ourselves in just such a predicament. We need to know the name of the file the user wants to display, but that's all we need to know. Isn't there an easier way to pass simple parameters from the user of a program to the program itself?

As you've probably already guessed, there is. Instead of asking for a filename after the program has begun, we'll just require the user to type the filename after the name of the program (with at least one space between the two), on the same line, before hitting Return. For example:

dump a:glob.txt

Type that, hit Return, and dump will read and display the file glob.txt on drive A, if such a file exists on that drive. How does the program find out what was typed after its name when it was called? Read on.

Whenever GEMDOS loads a program into memory, it sets up a special area for that program called the "base page." The base page, according to an old piece of Digital Research documentation that I'm looking at "is a 256-byte data structure that defines a program's operating environment." In other words, it contains useful facts about the program such as, for example, how long its data area is, how long its program area is and so on. The second half of the base page contains what we're interested in right now: all the characters typed by the user after the program name and before Return, up to a total of 126 bytes. The entire string typed by the user, including the program name, is called the "command line"; anything typed after the program name (except for the Return) is called the "command tail."

A program running under GEMDOS is a lot like a subroutine. It was "called" by GEMDOS and, when it terminates, execution will return to GEMDOS. Its stack configuration is also similar to that of a subroutine. When it begins, the stack pointer is pointing to a return address within GEMDOS's command processor. Just above this return address is the address of the program's base page.

So the first thing our program does is read the base page address from the stack, just as any subroutine would read parameters that had been passed to it on the stack. The command tail string begins 128 bytes from the beginning of the base page, so we load that address into register a2. The command tail has a format somewhat like a BASIC string: Its first byte contains, not a character, but rather a number indicating the length of the string itself, which begins at the next byte. So we load this count into a data register, at the same time incrementing a2 to point to the first character. Now we're ready to begin.

Open, Says Me

The command tail (which we are assuming contains only a valid filename) and its length are passed to the subroutine open__file. The first order of business now is to copy the string to a location within our own data area (labelled filename in the data segment). If we decide to fool around with the string's contents, we don't want to do this in the base page area, where all sorts of undefined havoc might occur if things went wrong.

There's one bit of fooling around you should probably always do: append a null to the end of the string. Depending on whose documentation you read, GEM-DOS does or does not terminate the base-page copy with a null. I've never bothered to find out which is true (besides, things might change in a later version); I append it myself and that way I know it's there.

By the way, it would seem to be a good idea, when copying the command tail, to skip over any leading spaces in the string. However, it doesn't seem to make any difference to GEMDOS whether there are spaces in front of the filename string or not, so I skipped this step.

Now all we have to do is open the file. We could do some more error-checking on the filename string, but it isn't really necessary. If the filename for some reason isn't valid, then GEMDOS won't open the file, because it won't exist, and we'll find that out as soon as we try to open it. So let's.

The 3D Function

Of course, all the GEMDOS functions are, in a way, multi-dimensional. In this case, though, "3D" refers to the hex code for the GEMDOS file-open function.

The function takes three parameters, passed on the stack, as usual. First comes a number (word-size) from 0 to 2. This code tells GEMDOS whether you want to open the file to read only (0), write only (1), or to read and write (2). We pass a 0 to indicate that we want only to read from the file.

Next comes the filename string itself, which must be terminated by a null. We pass the address of our null-terminated copy here. Last comes the function code itself, $3d (61 decimal, if you like things that way).

GEMDOS will now try to open the file. If it's successful, a "file handle" will be returned in register d0; otherwise, an error code will be returned. The file handle is simply a number that GEMDOS uses to identify the file once it's been opened. Handles can be used for other I/O devices too: 0 and 1 refer to the keyboard and screen respectively (so, apparently, do 4 and 5), 2 refers to the RS-232 port, and 3 to the printer port.

Assuming that there is a formatted disk in the drive you want to access, an error return can only mean that the filename is invalid: Either there's something syntactically wrong with the name as originally typed by the user, or the name is correct but the file doesn't exist.

GEMDOS error codes are always negative numbers, so all we have to do is test d0 and abort if it contains a negative value. Otherwise, we save the returned file handle (in an area labelled handle), and return to our caller. Note that our subroutine open__file returns to its caller in d0 the same code (or handle) that was returned to it by the GEMDOS function. In order to do this, we must remember to not save d0 at the beginning of the subroutine, even though we use it, or "restore" it at the end.

Meanwhile, Back At the Branch...

As usual, after we return to the instruction immediately after the "branch to subroutine" that called the subroutine, we adjust the stack to compensate for the extra parameters we pushed. Now we test d0 to see if open__file did in fact open a file. If d0 contains a positive number, then this must be a file handle and we can safely proceed. Otherwise, no file was opened and we must terminate the program.

Assuming everything went well, we now branch to the subroutine display__file, which does the real work of this program.

Translating Codes

"Dump" is a very simple program. It reads from its input file one byte at a time, and writes the value of each byte (one per line!) to the screen. Only ASCII codes (those between 32 and 127 inclusive) can be printed "raw," so some translation has to be done to the other values.

The values 0 through 31 are often called "control codes." Some of them are very familiar (such as 0, null, or 13, carriage return); others are quite obscure (my personal favorite is 21, "negative acknowledge"). Control codes were first used, as I understand it, on machines such as teletypes. They were adequate for managing the rudimentary formatting capabilities available at that time, as well as communicating "overhead" information about data transfers between machines (for example, end-of-text, start-of-message and so on). Nowadays host-terminal communications are much more complicated, and elaborate systems such as the ANSI escape sequences have been evolved to handle them—note, however, that "Escape" itself is a control code (27).

Most of the control codes are little used, but some of them are used all the time. Our program, whenever it reads a control code, translates it into its standard two or three-character abbreviation, which it then prints to the screen. The strings containing these abbreviations are found in the Ctrl table in the data segment. There you can also find, if you're interested, what all those puzzling abbreviations actually mean.

Of course, like beauty, control codes exist only in the eyes of the beholder. A programmer can choose to use these values for something else and, as long as he or she is consistent about it (and the I/O routines don't interfere), there will be no problem. Or, consider the values you'd expect to find in a program file (i.e., a file containing "runnable" machine code). Values from 0 through 31 will be treated by our program as control codes, which they almost certainly won't be: they'll simply be machine instructions (or byte-sized parts of machine instructions) that happen to have values within this range. In general, control codes occur with regularity only in text files.

The values 128 through 255 can be used for all sorts of things, depending on the computer. Often "extensions" to a computer's ASCII set are implemented here: such things as "graphics characters," or math symbols, and so on. Our program simply translates all such values into two-digit hex numbers and prints them to the screen.

Reading a File

Now let's see how it's done. Most of display__file consists of a large loop (its top is at d__go). At each iteration of the loop one byte is read from the file selected, translated (if necessary), and displayed. This continues until the end of the file is reached. The user can stop the output at any time by pressing the keys Control-S (control codes again! "S" is the 19th letter of the alphabet, so this is actually code 19: "DC3"); press Control-Q and the display will resume. Pressing Control-C will abort the program.

GEM DOS function $3F is used to read from a file. It takes four parameters. First comes the address of a memory area (within your program) into which the data read is to be copied. This area should be large enough to hold the largest amount of data you plan to read at one time; otherwise, data after this location could be overwritten by data read from the file.

The second parameter is a byte count. Although we're only reading one byte at a time here, you can use the function to read much larger amounts of data on one call. Actually, our method isn't quite as inefficient as it looks: GEMDOS has an internal buffer which it always tries to fill on a read operation. Subsequent calls to the function simply return data from the buffer until it is exhausted, when another read is performed, and so on. The GEMDOS buffer seems to be about 512 bytes in size.

The third parameter (word-sized) is the file handle; the fourth (also word-sized) is the function code, $3F (63 decimal).

The GEMDOS Read function returns a value in register d0. If the value is negative, then some sort of error has occurred (we shouldn't have to worry about that right now). Otherwise, d0 contains the number of bytes read. This should be zero (0) when end-of-file is reached, and that does seem to be the case when reading single bytes. However, the Abacus book Atari ST Internals warns that Read never detects end-of-file, and that the programmer should get the file's size from the directory in order to find out how far to read. Single-byte reads do seem to pick up end-of-file, so we'll keep our fingers crossed until next time, when directories is one of the things we'll learn more about.

I'll save the rest of the detailed explanation until next time, but I would like to mention the two new instructions used this time, as well as explain a new use for a third.

The rol (Rotate Left) instruction is something of a variation on the logical shift instruction we learned a couple of installments ago. The shift instructions, you'll remember, do just that: shift the bit-values in a data register the indicated number of bits left or right. Values shifted out of a register are lost. The rotate instructions are a bit different. The value in the specified register is shifted right or left as before; but the bit-values shifted out one end of a register are inserted back into the register at its other end—nothing is lost. You can specify byte-, word- or longword-size for a rotate. Suppose the low byte of register d0 contained the following binary value:

10000001

After the instruction rol.b #1, d0 is executed, d0's low byte will contain the following value:

00000011

The leftmost 1 was rotated leftward out of the byte, and back into the right end of the byte. The number of bits to rotate is indicated the same as with the shift instructions: an immediate value can be used to specify up to eight rotates at one time, otherwise a data register (in the source operand field) contains a number indicating the number of rotates to perform. In our program, register d3 (its entire 32 bits) is rotated left four bits.

The shifts and rotates between the labels do__hex and do__ctrl are used to get at the two low half-bytes (or "nybbles") in d3 separately in order to translate each into a hex digit.

The instruction pea (Push Effective Address) works just the same as tea (Load Effective Address), which we learned about last time; the only difference is that, after generating the address value specified, pea pushes the result on the stack instead of loading it into a register.

Finally, note how the lsl instruction is used under do__ctrl to multiply the contents of a register by four. Just as, in decimal numbers, moving a digit one place to the left is equivalent to multiplying it by ten, moving a binary digit one place to the left is the same as multiplying it by two. Do this twice and you've multiplied by four. This little trick can be used whenever you want to multiply a positive integer by a power of two, and it can be much faster than using mul.

That's all for this time. Type in the program, assemble and run it. Next time I'll explain the rest of display__file in more detail, and we'll learn some more file operations. Until then, you might think about making some obvious improvements to the program as it now stands. For example, how would you go about printing, say, four bytes (rather than one) to a line?