Assembly line: Inside Subroutines

PROGRAMMING

Assembly line

Inside subroutines.

by Douglas Weir

I mentioned last time that there's a way of looking at values on the stack without disturbing the stack itself. Doing this requires learning a new addressing mode, but it will be the next-to-last of all of them. This new mode, combined with a couple of other powerful instructions, will make it easy for us to write well-behaved, modular subroutines. We're now getting to a point where some of the things we do in assembly language can give us insight into how and why certain high-level languages—I'm thinking of C in particular—do some of the things they do.

This month's program, after you've assembled and linked it, will display four strings and quit. It's not a very exciting performance, but the simplicity of what's going on will let us concentrate on what's really important: the inner workings of a subroutine.

First, though, I'd like to tie up my discussion of loops from last time by introducing the 68000 instruction that is most often used in looping situations: dbra, or "decrement and branch."

A decremental journey.

In our last program, we wrote a loop used to scan through a number of keypress possibilities. The number of keys we were willing to consider was loaded into a register. Each time we went through the loop, we subtracted one (by means of a subq instruction) from the contents of this register. Then we tested the result of this subtraction with a bne instruction. If the count was down to zero, that meant the loop was finished; otherwise, we branched back to the top of the loop and went through the whole process again, and so on.

That's fine, but there's a quicker way of doing the same thing. The 68000 has a set of instructions specially designed to handle everything required to manage a loop. The instructions execute in two parts. Each begins by performing a test (i.e., looking at the state of one of the bits in the Condition Codes register). If the result of the test is "true," execution of the instruction ends and the 68000 proceeds to the next instruction. However, if the test result is "false," the second "half" of the instruction is executed. One (1) is subtracted from a specified data register (the first of two operands for this instruction), and now this result is tested.

If the contents of the register have been exhausted, the instruction ends. If the contents of the register have not been exhausted, the 68000 branches to a label (which is the second operand specified for this instruction).

The most-used of these instructions is, as I've mentioned, the dbra. Actually, its proper name is dbf: "if ‘false’ is not ‘true,’ decrement and branch." And what does that mean? Let's look at a simple example.

        move.w  #10,d0   load loop count
loop:   nop              body of loop
        dbf     d0,loop  keep going...
        bra     out      till finished

We start by loading d0 with the immediate value 10. (Note, by the way, that we're loading it as a word value.) The next instruction, which is labelled loop, does nothing: it's our old friend nop. Nevertheless, this is the body of our loop. The next instruction contains all the loop control code we need. First, it checks the specified condition. What's the condition? That's easy—it's false (that's what the f stands for).

So the 68000 goes through a chain of reasoning something like this: "Let's check the condition. Is false true? No, it's not. So I can't drop out of this instruction yet. Now I'll subtract one from the register d0. Ten minus one equals nine, so the loop counter isn't finished yet either. That means I'll have to take the branch. . . Let's see, I'm supposed to branch to loop. Here goes. What's this—a nop? Okay, you want nothing, you got nothing. Now here's a dbf. Let's check the condition. Is false true?..." And so on. As you can see, there's a philosophical side to assembly language.

The main point of dbf is that the initial condition always tests out false—false can never be true, after all. So the decrement is always performed, and the loop always continues until the loop counter register is exhausted (unless, of course, you insert some other, unrelated code in the middle of the loop that causes a branch). The dbf instruction, therefore, gives you something very much like a FOR loop in BASIC:

for i = 10 to -1 step -1
print "howdy"
next i

or Pascal:

for i: = 10 downto -1 do
writeIn(‘howdy’);

or C:

for (i =10; i > -1; i--)
print f("howdy\n");

With dbf, it's almost as if the first test isn't there—only the "decrement and branch" part of the instruction is in operation. For this reason, most 68000 assemblers accept the alternate form, dbra (i.e., decrement and branch—for the strictly correct dbf—until false is true, decrement and branch).

This may seem like a lot of theoretical rigamarole to go through just for an automatic loop counter instruction. But consider one of dbra's cousins, dbeq:

         move.b   #key, d0    char we're looking for
         move.w   #10,d1      loop count
         movea.l  #table,a0   base of table
loop:    cmp.b    (a0)+,d0    check char in table
         dbeq     d1,loop     if not found, continue loop
         nop                  do nothing

The first instruction loads a byte value into d0. Let's assume it's an ASCII code we'll be searching for. (We'll also assume that key is defined with an equ directive elsewhere in the program.) The next instruction loads the loop counter with the immediate value 10. Then the base address of a table of byte values (also declared elsewhere) is loaded into register a0, and we're ready to start the loop.

This time there's no nop. Instead, we test whether the byte value, whose address is currently in a0, is equal to the byte value in d0. This is done with a cmp instruction (see issue 16's Assembly line for a detailed explanation of cmp). If the two values are equal, then the zero bit in the Condition Codes register will be set to 1 when the instruction ends.

It's precisely this bit that dbeq tests first (eq in opcodes usually stands for an "equal" condition—in other words, is the zero bit set?—meaning that two values were just compared and were found equal). The 68000 reasons it out like this: "Hmm, a dbeq. Let's check the condition. Is the zero bit set? No, it's not (that means the last two values compared were not equal). Okay, so I'll do the decrement. Ten minus one equals nine. There's still plenty left in the register. So I'll do the branch. Right—back to loop. Get the byte a0 is pointing to. Okay, got it. (Don't forget to increment a0.) Now compare it to the byte in d0. Next instruction: here's a dbeq. Let's check the condition. Is the zero bit set? Yes, it is (that means the two values were equal). So that's that—I can forget about the rest of this instruction. Let's go to the next. What's this—a nop?. . ." Well, you get the idea.

Thus, an instruction like dbeq gives you a built-in compound condition for continuing the loop: keep looping until a match is found, or the loop counter is exhausted:

for (i = 10, match = FALSE; !match && i > -1; i++)
match = table[i] == KEY;

You will find that dbra is by far the most-used member of this group of instructions (they can be found under the general heading dbcc in the Motorola manual), however, you should not forget the extra efficiency offered by the others.

The minus syndrome.

Now let's go back and tidy up a couple of details. No doubt you've noticed that in the high-level language versions given, the loop counts down not to 0, but to -1. In fact, that's exactly how the dbcc instructions work. They test for -1, not 0, in the specified counter register. In other words, loops controlled by these instructions will always execute one iteration more than the value initially loaded into the register. Our little test loop above will execute 11, not 10 times, as written. In order to get around this quirk, you must either use loop count values that are always one less than they should be (e.g., 9 if you want the loop to execute 10 times, etc.), or do something like the following:

          move.w  #10,d0     load loop count
          bra     test       now start
loop:     nop                do nothing
test:     dbra    d0,loop    keep going...
          bra     out        till finished

This is the same code as before, except the dbra instruction is labelled and a branch to this label is inserted to kick off the loop. The result is that the "extra" decrement-and-branch occurs before the loop really starts, and the body of the loop is only executed 10 times after all.

This is a favorite trick of mine, but there's a drawback: you waste a little time when you do things this way. One unnecessary bra and dbra are performed for every loop (not every iteration)! There are two schools of thought: assembly language is so fast that you usually can't tell the difference in these situations, but then many people go to the trouble of writing assembly language just to get away from this sort of redundancy (which is what compilers are past masters at). And, there are time-critical applications where the extra code does make a difference.

I'm in favor of a little redundancy in the interest of clarity and consistency. By doing loops this way—usually—I can use loop count constants which say what they mean. Whichever method you choose, you should try to be consistent, yet never forget that there's always another way of getting the job done.

It's important to note that a word value is loaded into the loop count register. That's because the dbcc instructions decrement and check the register as a word, not as a long-word or a byte. (This means, by the way, that a loop can only execute a maximum of 32768 iterations using these instructions.) Once you've written a mistake like this into a program, it can be very difficult to find.

Finally, notice what a powerful instruction dbeq (for example) is. Using it, we can write five lines of assembly code that are practically equivalent to two fairly dense lines of C code—not a bad ratio, considering the difference in language levels.

The plural of move is movem.

This month's other new instruction, movem, is not nearly so complicated. So far we've been using the stack to pass arguments to GEMDOS routines (via trap #1), and (implicitly) to save subroutine return addresses. We haven't saved any register values on the stack. This is easy to do, however, and is one of the stack's main uses. For example, we could write:

move.l   d0,-(a7)        save d0
move.l   a0,-(a7)        save a0
move.l   #message,-(a7)  get string address
move.w   #9,-(a7)        code=display string
trap     #1              do it
addq.l   #6,a7           pop arguments
movea.l  (a7)+,a0        restore a0
move.l   (a7)+,d0        restore d0

This bit of code would simply use GEMDOS function number 9 to display a string (labelled message), similarly to the sort of thing we've done more than once before. However, the current values of registers d0 and a0 are saved before the trap and restored afterward. If GEMDOS uses these two registers, it won't make any difference to us—we'll still have their current contents when the saved values are popped from the stack. We just have to make sure that we push and pop the values in exact opposite order.

Saving and restoring registers is such a common use for the stack that a compound form of this instruction has been included in the 68000 set: movem, for "move multiple." With one execution of movem, you can save one or all of the data and address registers (including a7) in any combination on the stack. With another execution, you can restore all these values. Using movem, the above code fragment would look like this:

movem.l   a0/d0,-(a7)      move 'em out
move.l    #message,-(a7)   get string address
move.w    #9,-(a7)         code=display string
trap      #1               do it
addq.l    #6,a7            pop args
movem.l   (a7)+,a0/d0      move 'em back in

Suppose you wanted to save more registers:

movem.l   a0-a2/a3-a4/a6/d0-d4,-(a7)   save 'em
(...do something here...)
movem.l  (a7)+,a6/a3-a4/d0-d4/a0-a2     get 'em

A sequence of registers to be saved is indicated by a dash: a0-a2 means a0, a1, a2. Groups of sequences are separated by slashes. A group can consist of a single register. (Note that you don't have to worry about the order in which you name registers.) The assembler, when generating the machine code for movem, generates a second word of code that consists entirely of bit-flags representing the registers. Each register has its own bit. If it's set, the register is pushed (or popped); otherwise, the register is left alone. The order of pushing and popping is always the same, and isn't dependent on the order in which you happen to write the registers.

By the way, if you're using the Motorola 68000 manual as a reference, you'll probably get very confused if you look at its little diagram of the format of a movem instruction (at the bottom of page B-71 in the 5th edition). The diagram seems to be saying that the instruction is only one word long, and that the word somehow does double duty as both a machine instruction and a set of bit-flags for the registers. That would be a truly remarkable feat of data packing. I had to look up movem in Programming the 68000 by Steve Williams (Sybex) to make sure that the entire instruction (with the register list) is two, not one, words long. So, if you're interested in this sort of thing, be warned: the Motorola instruction format diagrams are not models of clarity.

A see-through address mode.

The last new topic I want to discuss this time is that promised address mode that will let us peek into a stack, without popping any items. Consider the situation immediately after we've jumped to a subroutine by executing a bsr instruction. What is the stack pointer pointing to? That's easy—the return address, a longword value. We know this because the return address was automatically pushed onto the stack by the bsr, and that was the last instruction executed.

Furthermore, we know that at anytime after entering a subroutine in this manner, the execution of an rts instruction will immediately return us to where we came from (as long as we haven't fooled with the stack in the meantime).

Suppose we wanted to move the return address into another register, without disturbing the stack. We already know how to do that:

move.l    (a7),d0   get return address

Register a7 is pointing to the value, so we simply use the Address Register Indirect mode to read it via a7 and move a copy of it into d0. This mode (which I sneaked into issue 16's program) is the same thing as Address Register Indirect With Predecrement or Postincrement, only without the decrement or increment. The value of the register is used as an address, and the value remains unchanged.

Now suppose a word value had been pushed on the stack just before the bsr was executed. At entry into our subroutine, a7 points to a longword that's the return address. The word value pushed before the bsr is therefore in the two bytes just after the return address. If we added 4 to the current value of a7, it would then be pointing to our word. But we don't want to alter the value of a7.

When you think about it, writing:

move.l  (a7),d0  get return address

...is just another way of saying:

move.l   0(a7),d0   get return address

...that is, we're adding a zero offset to a7 to get an "effective address" that's used to read a value. Can we use offsets other than zero?

move.w   4(a7),d1   get the word that was pushed

...yes, we can. In this line of code, the 68000 adds the offset value 4 to the value of a7, and uses this new effective address to read the word value. The value of a7 remains the same; it's still pointing to the return address, ready to execute an rts at any time. This new addressing mode is called Address Register Indirect With Displacement (the latter word is Motorola's term for offset).

The offset value is a signed 16-bit value, which means that it must be greater than or equal to -32768, and less than or equal to 32767. (The Megamax C compiler allocates space for global variables and strings using this mode. It equates the variable names to offset constants, then uses the offsets with an address register to reference the variables. Strings get negative or positive offsets, and globals the other... I forget which. This explains the 32K-byte limit Megamax enforces on data segments.)

The term "effective address" is an important one. So far, we've been able to imagine that every address we've used in a program actually exists somewhere—either as a value in a register, or as a constant generated by the assembler from a label. This time, however, the address we use to read the pushed word is... nowhere. The 68000 takes the value in an address register, adds a specified offset value to it, then uses this new temporary address to access memory in some way. The result of adding the offset to the register value is the effective address of either the source or the destination operand (here it was the source). After the instruction is executed, the effective address goes away—just like the result of a cmp intruction. We'll be using the concept of effective addresses often in coming installments.

The program.

The program should now be very easy to understand. At the bottom is declared a table of four strings. Actually, it would be more meaningful to describe it as an array of strings. Above the strings are their addresses, arranged so that we can easily index through them with an address register.

The program begins by pushing the address of the table of string pointers on the stack, followed by the number of strings. Then we jump to the subroutine display.

The subroutine first pushes four registers (as longwords) onto the stack. That means the return address is now located 16 bytes beyond the current position of the stack pointer (a7), which was decremented by 16 to accommodate all the pushed data. The word item—count—that was pushed just before the bsr is located 4 bytes beyond the return address, making an offset total of 20 to be added to a7, if we are to read count. Again, strings, which was pushed before count, is 2 bytes beyond the latter (remember: count is only a word value), and an offset of 22 is therefore needed to access it.

Registers a1 and d1 are loaded with strings (the base of the string address table) and count, respectively. We don't have to worry about data that might be lurking in the high bytes of d1: since the dbra instruction only uses the low word of the specified register, we can ignore the high word.

The loop is straightforward enough. The dbra is branched to at the start, thus ensuring that the loop will execute only four, not five, times. Note that registers a1 and d1 are used in the loop—not a0 and d0. That's because GEMDOS calls can affect the contents of a0 and d0: the routines use these registers, but (to save time) don't bother to preserve their old values.

When the loop ends, we restore the registers we saved at the beginning of the subroutine. After this is done, the return address is again the next item on the stack. Execution of rts pops it into the program counter, we return to the main program, and immediately execute GEMDOS call number 0: terminate program.

Next time we'll continue our discussion of subroutines with (among other things) one of the most complicated of all the 68000 instructions: link.