Assembly Line: Part 2: Multiple-precision multiplication and multiple-multiple-precision division

Assembly Line

Numbers part 2: Multiple-precision multiplication and multiple-multiple-precision division

by Douglas Weir

For example, suppose you have the binary number 1101. Now let's imagine that this number is held in two bytes of two bit each. The first (high) byte contains 11, and the second (low) byte contains 01. If we want to print this number as a string of hexadecimal digits, there's no problem we convert the first byte (separately) to 3 and the second byte (again, separately) to 1. But what if we want to convert the number to a decimal string? Binary 1101 equals decimal 13, and there's no way you're going to come out with this result by converting the bytes separately.

No matter how many separate location; binary 1101 is contained in (well, as long as it's not more than four, at any rate), there's no great difficulty in converting it to decimal ASCII on the 68000, since the value 13 is well within the limits of its divu (divide unsigned) instruction. But our 64-bit products are much too big for divu, and, to make matters more difficult, they're bigger even than a data register—twice as big, in fact. So our 64-bit division routine will have to be able to divide two numbers, each of which can be up to twice as long as the 68000's register length. When I first realized that I had to upgrade my 32-bit divide to 64 bits. I thanked heaven I had chosen the simple (i.e., stupid) way to do it the first time. The modifications weren't hard at all.

Well, well, well. We've successfully multiplied two 32-bit numbers. Now what do we do with the product? First of all, how do we print it out? It doesn't take long to realize that, in order to convert a 64-bit number to an ASCII string. we need more than the 32-bit division routine in our original program; we now need the ability to divide 64-bit numbers.

Of course, it's easy enough to print separately the high and low longwords that mul32 returns. But we can't just print these numbers next to each other to get the full double-longword result. We wan to represent our product as a decimal number, but it's held in memory in binary form, and there's no direct correspondence between the binary and decimal digits.

Subtract and conquer

Although there wasn't space to discuss it last time, you've no doubt noticed that div32 doesn't really do any dividing. Instead, one number is "divided" into another by simply subtracting it from the dividend until the dividend is either less than the divisor or equal to zero. At this point the number of subtractions is equivalent to the quotient from dividing the numbers, and what's left of the dividend is the remainder.

This is a perfect example of a "brute force" algorithm. There's an interesting characteristic of brute force approaches— they almost always work, and when they don't, you know why. You can't say that about "complicated" algorithms, and "real" long division in assembly language is definitely complicated.

Take a look at div64 in this month's listing (the entire program has been reprinted, since the 64-bit upgrade made other changes necessary too).

To subtract two double-longword values, you begin by subtracting the two low longwords and storing the result. There may be a carry generated as a result of the subtract; this is the equivalent of a "borrow." If, for example, you were to subtract decimal 17 from 24, you would begin by subtracting 7 from 4. Since 4 is less than 7, you borrow a 1 (actually a 10) from the next decimal place to the left, and make the 4 a 14. Now 14 minus 7 is 7; you write the 7 down and subtract the next column, and so on.

A borrow can occur in exactly the same way in a 68000 data register. When this happens, the Carry bit in the Condition Codes register is automatically set to 1. You can thus subtract numbers a hundred bytes (or words, or longwords) long, simply by subtracting byte by byte, checking the Carry bit after each subtract, and subtracting an extra 1 from the next byte whenever a borrow has occurred. Why the extra 1?

Because when a borrow "occurs" in the 68000 (or most other processors, for that matter), it doesn't really happen—the Carry bit lets you know that it should happen. After all, the 68000 has no way of knowing the location of the next-higher unit of the value you're subtracting from. It could be a register or a memory location. So, when this situation occurs, the current subtraction is handled as though a borrow was completed, but it's up to you—the programmer—to make sure that the borrowed digit is actually subtracted from the "next" value. Understanding the mechanics of subtracting a larger from a smaller binary value requires learning about signed binary numbers, and we'll leave that for another time.

Since we only want to subtract double longwords, all we have to do is subtract the low longwords, check for a carry, and (if the Carry bit is set) subtract an extra 1 when we subtract the high longwords. However, there's an even easier way to do this on the 68000.

Extending a helping hand

The designers of the 68000 wanted to make it as easy as possible for programmers to implement multi-precision arithmetic operations. There are times when it's not convenient to detect and react to a carry condition right away. However, we know that the condition codes are very volatile—the Carry bit is set or reset by most instructions. So instead of forcing the programmer to explicitly save the condition in such circumstances, the 68000 has a special bit in its Condition Codes register that behaves identically to the Carry bit, but only when certain instructions are used. The bit is called the Extend bit, and instructions that use or affect it always contain an x in their names. For example, subx: subtract with extend.

A subx has exactly the same effect as a plain sub—it subtracts the source from the destination operand, and puts the result in the latter. The difference is twofold: first, the Extend bit is set or cleared in exactly the same way as the Carry bit (which is still affected as before, by the way), and it won't be altered until another "extend" instruction is executed or the programmer explicitly changes the contents of the Condition Codes register. (We'll see how this works in a moment.) You can execute a subx and then execute any number of other instructions and be sure that the carry condition will be saved in the Extend bit until you need it.

The second difference in subx is that the carry from the previous subx will automatically be incorporated in the next subx. In other words, if a subtract operation generates a borrow, then the next subtract will Subtract both its source operand and the borrowed bit from its destination operand, if both operations are executed with subx. It seems that all of our work will be done for us.

There's one hitch with the extend instructions, though. You are always restricted to much fewer addressing modes when you use them. For example, subx will allow you only to subtract a data register from another data register, or a value in memory from another value in memory, both of which must be accessed with the predecrement addressing mode. This won't be a problem for us here: our operands are already in data registers.

Since the effect of any subx always depends partly on what the value of the Extend bit already is, we would like to be sure that this bit has a value of zero when we do the first subtract—otherwise a spurious borrow could be subtracted the first time around. There is a way to write values into the Condition Codes register, but it's a rather roundabout way.

Logical operations

Yes, I know—all my operations are logical. But what I have in mind here is a special kind of logic whose rules have a lot to do with the way computer hardware is implemented.

Suppose we want to know whether a statement (we'll call it X) is true or false. Suppose further that we know that, if statement A is true or if statement B is true, then X is true. Now all we need to know is the value of A or B. If A is true, then X is true, no matter what the value of 6. And if B is true, then again X is true, no matter what the value of A.

On the other hand, it's easy to imagine a situation where both A and B have to be true in order for X to be true. In this case, if we know that A or B is true, then we don't know enough: everything depends on whether the other statement, B or A, is true also. But if we know that A or B is false, then we can stop right there: X must be false also, since both A and B must be true if X is to be true.

Now let's substitute the values 1 and 0 for the rather abstract concepts of "true" and "false." If we have three variables, called A, B and X, then we can rephrase our rules as follows. In the first case we can say:

If A = 1 OR B = 1 then X = 1

And in the second case we can say:

If A = 1 AND B = 1 then X = 1

Now let's make up a little table showing all the possible values of the three variables for the two relationships:

A	OR	B	X
1		0	1
0		1	1
1		1	1
0		0	0
A	AND	B	X
1		0	0
0		1	0
1		1	1
0		0	0

This is not as trivial as it looks. Semiconductor "logic" is implemented by complex combinations of transistors that "gate" electrical current through various paths according to these rules (and others). There are also 68000 instructions that operate on registers and memory locations according to these rules, and that's what we're interested in here.

Think of A, B and X not as variables but rather as bits in a binary representation of data. The 68000 instruction andi.b #$01,d4 will take the immediate byte value 1 hex (i.e., a byte with only its rightmost bit set) and match it with the low byte in register d4. Now the bits in the two values will be compared according to the logical relationship specified (in this case, AND), and the result will be put in the destination operand, register d4. In other words, bit 7 of the source operand will be AND-ed with bit 7 of d4, bit 6 with bit 6, and so on. We don't know what's in d4, so we can't say definitely what the result of executing this instruction would be, but we can say that d4 will have to end up with either the value 1 or 0, since the zeroed upper bits of the source operand prevent any of those bits from being 1 in the result.

Changing conditions

We need to know about the 68000's logical operations here because these are the only instructions that allow us to write to the Condition Codes register. We want to make sure that the Extend bit is 0 before we execute our first subx, but we don't want to affect any of the other bits. According to the rules above, if we AND a 1 with a 0, the result will be 0; and if we AND a 1 with a 1, the result will be 1. Also, if we AND a 0 with anything, the result will always be 0.

So we want to AND an immediate value with the Condition Codes register that will contain 1s to be matched against all but the Extend bit. The Motorola manual tells us that the Extend bit is bit 4, so we want a binary value that looks like this (the Condition Codes register is only 8 bits long): 11101111. This is the same as hex EF. Actually, we don't have to worry about bits 5, 6 and 7: they aren't used. And since bit 4 is 0, we can shorten the number to binary 1111, or hex F. The instruction would then be written as andi #$0f, ccr. There's no size specifier, since operations on the Condition Codes register are automatically byte size. This instruction, when executed, will clear the Extend bit and leave the others alone.

Dividing 64 bits

Aside from these new instructions, div64 is fairly straightforward. Since we're dealing with double-longword values, we have to work with register pairs for the operands and for the "subtract counter," which will contain our quotient. The routine begins by making sure that the divisor isn't zero. If it is, then it loads register d4 with an error code and returns to its caller. Note that none of the "return registers" (d0 through d3) are used for the error code. Now that we're dealing with all possible 32-bit values as valid return values, we have to reserve a separate register for any return codes.

If all is well, then the main loop is entered. The numbers must be compared to see if the dividend is still larger than the divisor—when it isn't, or when it becomes zero, then we're finished. The subtract itself is accomplished with two subx instructions—one for each longword. Each complete subtract is counted in d5; when there's a carry out of d5, it's added explicitly into d4. Since we have a pair of data registers (d6 and d7) that aren't being otherwise used, we could have used a pair of addx (add with extend) instructions instead by loading, say, d6 with 0 and d7 with 1 at the beginning. Then we could do the following:

addx.l d7,d5 count subtract

First the 1 in d7 is added to d5. Then the 0 in d6 is added to d4. If there was a carry out of the first add, then the Extend bit was set, and it will be included in the second add. There's no need to clear the Extend bit before this operation: if our comparisons at the top of the loop are correct, then there's no possibility that the second subx could have generated a borrow. Instead of using addx here, however, I thought I'd show how the operation is done explicitly.

Finally, the remainder and quotient are transferred to the correct return registers, a "success" code is loaded into d4, and div64 returns to its caller.

ASCII you shall receive

Of course, the divisors table had to be expanded to allow a 64-bit conversion, and cv64, the conversion-to-ASCII routine, is changed also as a result. Most of the differences between it and cv32 from last time involve the data registers used for certain operations, and the new double-longword table.

The algorithm used here for ASCII conversion is simple. The number is divided by decreasing powers of 10 (beginning with a number equal to a 1 followed by 19 zeros—I'm not sure what it's called!). The quotient from each of these operations is a number from 0 to 9, which is then converted to ASCII by adding it to the ASCII value for zero; the remainder is then divided by the next-lower power of 10 to get the next digit, going from left to right, and so on, until 10 to the power of 0 (i.e., 1) has been used as a divisor, which yields the rightmost digit. The powers of 10 used by cv64 come from the divisors table.

The routine begins by writing spaces to the string area in dgts. Then it checks to see if leading zeros are to be written into the result. Register d6 gets the result of this test. You can think of d6 as a flag that tells whether the converted number is, so far, a zero (TRUE) or non-zero (FALSE). In the former case, spaces are written instead of leading zeros; otherwise, zeros are written whether leading or not. If leading zeros are not desired, then d6 is immediately set to TRUE: the number is considered to be a zero until a nonzero digit occurs. Otherwise, d6 is set to FALSE, the number is considered to be non-zero from the start, and consequently all zeros are written. At the end of the loop, if d6 is TRUE, then the number was a zero and only spaces have been written so far; so a single zero is written in the last byte of the string.

The digits are converted leftmost first, and the converted bytes are written to the string with the Address Register Indirect with Index mode. Register a0 points to the end of the string. Register d5 is loaded with the length of the string; this value is then negated with the neg instruction. When the negative value in d5 is added to a0, the result is the address of the first byte in the string; subsequent additions to d5 point further into the string; when d5 contains 0, the end of the string has been reached.

Back to numbers

The routine cvnbr handles the ASCII-to-number conversion. It has been changed slightly since last time in order to work with the new version of the factors table and also to use d1 to return a possible error code. The numbers to be multiplied can only be up to 32 bits long, so our task here is a bit easier. The routine begins by doing a crude check on the size of the string it's received. Register d1 is loaded with the maximum size allowed plus 1, and is then used as a counter as the string is indexed through until (or if) a null is found. The dbeq instruction will get us out of the loop as soon as a null is found or when d1's value becomes -1, whichever happens first. If it's the latter, then the string was too long.

(If you remember our discussion of the dbcc family of instructions several episodes back, then you know that they decrement their counter to -1, not to 0, before terminating. We loaded d1 with a value 1 greater than required, which would seem to set up for two extra iterations of the loop. However, we branch immediately to the dbeq the first time around, which uses up one of the extra iterations; and the null terminator, which is counted, uses up the other.)

Although this test will catch any input number longer than 10 digits, there are some illegal numbers it won't catch: any 10-digit number greater than 4,294, 967,295 to be exact. Numbers larger than this are too big to be held in a 32-bit register; multiplying them with this program will give a result of zero. There are various ways these values could have been checked for, but I didn't want the routines to get too long.

After this, it's just a matter of reading each digit from the string (starting with the leftmost), multiplying it by the next factor from the factors table, and adding the product to a cumulative total. We know that our product here will be much smaller than even 32 bits, so the high longword is simply discarded. The table of factors is simply the divisors table reading in the opposite direction; register a1 is decremented to index through it. Since divisors has double-longword values but only longwords are needed for factors, a1 is decremented an extra time to skip each extra longword. Converting an ASCII digit to a number is only a matter of subtracting the ASCII value of zero from it.

Carried away

Only one instruction has changed in mul32, but it's an important one. The instruction just before the label m__nxt0 adds in a carry from the previous addition to the product. Up until now I'd never used the high longword returned by mul32; I converted some random values by hand, and they seemed to be correct. But as soon as I was able to look at a lot of 64-bit conversions, it became clear that there was a subtle error in the way I was handling the carry.

Remember that a carry out of a data unit means that the bit can't be represented in that area: the number is now too big for it. As mul32 points back through the product area to add in successive partial products, it does point to the new memory area, which is where the carry should go; however, the indexing is done by words (i.e., two bytes backward at a time), whereas the partial products are added in as longwords. In other words, each new partial product "overlaps" with the old by a word. This is the way it should be—that's the way the numbers are added. However, it means that the carry shouldn't be added directly into the partial product as a longword—that would be the same thing as trying to stuff it back into the long-word it just popped out of. Instead, it should be added into the first entirely free word (or byte, to be precise) in the product area.

With a0 and d0 pointing to the new partial product area, we can access the first non-overlapping word by writing to the address as a word rather than as a long-word. If we tried this with a data register, it wouldn't work; we would be writing to the low word of the register. But in memory it's different: writing to a word at a given address will access the two bytes at that address and the next higher one; writing to the longword will access those two bytes and the next two as well, which are just the ones we don't want to touch. The new instruction addq.w #1, 0(a0,d0.I) adds in the carry correctly.

Why jsr?

I promised last time to explain why you can jsr but not bsr to a subroutine indirectly. The bsr instruction calculates the new address to be loaded into the Program Counter (PC) by using a number which is an offset from the current value of the PC. In other words, when a program is assembled, the assembler calculates for every bsr instruction the distance from it (or rather the next instruction after it) to the label in the instruction's destination field. The assembler can do this because both the value of the label and the location of the bsr are known to it, and the distance to (the "relative address" of) the destination is the difference between the two.

Jumping to an indirect address is different, however. The assembler knows only that the address will be in a certain register; it can't possibly know what that address will be, so it can't calculate the distance between the jump and the address. So this has to be an absolute jump; the jsr instruction simply loads its destination address into the PC, and—jumps.

But why do we need relative addressing at all? The answer is that it allows "position-independent" code—programs that can be loaded and run anywhere in memory, since what seem to the programmer to be absolute addresses—e.g., labels of subroutines—are actually represented as relative addresses in the program.

The rest of the program

The beginning of the program should be self-explanatory. GEMDOS function 10 (hex 0A), READLINE, is used to read the number strings from the keyboard. It's an easy way to input an edited string. To use it, you just push the address of a string area (more about this in a moment), then the function code, and call GEMDOS. The first byte of your string should contain a number (less than or equal to 255, of course) that tells GEMDOS the maximum number of characters to input. When the function returns, the second byte of the area will contain the number of characters actually input. If you add this number to the address of the third byte in the area (which is where the input characters begin), you will get the address of the carriage return (hex 0d) that ended the input. Note that if you want a null-terminated string, this is where you should insert the null; READLINE does not do this.

Typing in a number larger than 10 digits will end the program with an error message; otherwise, you can end it by pressing Control-C.

Now that we can divide 64-bit numbers, it seems only natural to multiply them. Of course, to check our result, we'd have to have a 128-bit divide routine. But it seems silly to stop at 128 bits—I wonder if IBM started this way?

Program listings begin on page 78.