Inside the ST Xformer
A SPECIAL INCLUSION
Part 2—the hardware.
by Darek Mihocka
Darek Mihocka is a second-year co-op Computer Engineering student at the University of Waterloo (near Toronto). Between school terms, he works at Microsoft Corporation in Seattle. He is a licensed pilot and also enjoys heavy metal music.
Last time, we presented Part 1 of Inside the ST Xformer. This, the conclusion, continues where the first installment left off. The program files described in this article were offered on the disk version, which you may order as a back issue — or you can find the listings in the ANALOG Computing Atari SIG, on Delphi.
Simulating hardware involves figuring out which hardware register is being accessed, and what to do about it. For example, if a memory location that appears on-screen is being written to, the screen display must be updated. All code for hardware simulation is in the file __XATARI.C.
This introduces a problem: how to trap the emulator when one of these special memory locations is being accessed, without introducing too much overhead. This is where the second 64K block of memory (pointed to by stat and REGSTAT) is used.
Each byte of the block pointed to by stat corresponds to a byte of the 64K main memory. The stat bytes are exactly that: status bytes which indicate the status of a byte in main memory. If the status of a memory location is a 0, it can be read or written freely. If its status is nonzero, the access must be trapped and handled.
You may notice in the code that elements of mem are accessed through pointers, while elements of stat are accessed through an index. The reason I use an index in the second case is so that I can use the 16-bit offset mode of the 68000 to quickly check the status byte.
Some memory requires special handling for both read and write, and some memory only requires handling for write operations. For example, screen memory can be read freely without the need of any handling, while writing to the screen must be trapped so that the screen may be updated. To classify memory into one of these two categories, I use the high bit of the status byte to indicate the type of handling required. Here's a brief summary:
Status byte:$00..................Memory is regular RAM; no special handling. $01-$7F............Memory can be read freely, but any write access must be trapped and handled. $80-$FF...........Both read and write operations must be trapped.
Note that there are no memory locations that can be written to freely but can't be read freely. Therefore, those types of status bytes do not exist.
The status numbers correspond to which handler routine must be used for that particular memory location. The array serv__hdwr is an array of 256 pointers, which point to up to 256 different handler routines.
For example, ROM locations have a status byte of 2. The array element serv__hdwr contains a pointer to a routine which does nothing. In other words, an attempted write to a ROM location will result in no write at all. This is the simplest example out of the few dozen that are actually implemented.
Note that read and write operations are handled quite differently. Write operations return straight to the main loop though DISPATCH, while read operations have to return to the opcode routine to complete the read. Thus, one method exits through a JMP, one through an RTS. The variable isread must be used to keep track of this, to prevent major catastrophic stack overflows! I cannot emphasize enough the importance of setting isread.
Most hardware locations are fairly easy to emulate. For example, location 53770 will return a random number when read. The handler code at s__rnd is quite simple. The value in isread is checked by the macro TESTWRITE. If it is 0 (a write operation) a branch is made to nul, which does nothing and dispatches. If it is a read, a new randon number is generated, stuffed into memory location mem+53770, and an RTS returns back to the calling routine, usually doLDA, doLDY, or doLDX.
Note that the macros LOADREGS and SAVEREGS must be used anytime we go from assembler back into C. C uses registers D0-D7 and A0-A3 for its own storage, and so wipes out the 6502 variables.
Unfortunately, not all locations are this simple to emulate. The majority of the code in __XATARI.C is dedicated to screen handling. The Atari 800 has about seventeen different display modes and, through display lists, it can display them all at the same time, and scroll each one independently. This is simple on the 800, since the ANTIC and GTIA chips do all the work, at the same time that the 6502 is doing its stuff. It's a nightmare on the ST, since it must all be handled by the 68000, and as quickly as possible.
It became necessary to decide, for speed purposes, which features could be emulated in reasonable time and which couldn't. I decided that player-missile graphics, display list interrupts, and fine scrolling could not be supported efficiently on the emulator, given its already slow speed. What remains of my attempt at player-missile graphics is found at the end of _XATARI.C.
Once these features were axed, it was possible to write some very fast graphics routines, which work well with most 8-bit software. Of course, heavily graphics oriented demos still run much more slowly, but almost any display list combination possible is supported.
The graphics routines are called plot__2 through plot__F, and each one simulates one of the fourteen displayable ANTIC modes. Also, plot__F can simulate the three GTIA modes, for a total of seventeen modes.
Several problems still exist: all displayed bytes must have their status bytes modified, and, when a write to such a location is trapped, the memory address must quickly be converted into an X- and Y-location on-screen, and an ANTIC mode number—so that the appropriate plot__x routine can be called. Impossible you say?
The solution to this problem is a data structure I call DL, which is defined at the beginning of__XATARI.C; and an array of such structures called dlBlocks. Each entry in dlBlocks is similar to an entry in the actual display list, except that scrolling and interrupt bits are ignored, and consecutive same bytes in the real display list are merged into one DL. The structure DL contains information such as the ANTIC mode for that portion of the display, the height in scan lines, the number of bytes per line, the number of consecutive bytes displayed, the starting scan line, and the location of the first byte to display.
The whole process is triggered by a write to memory location 559 ($22F). Remember the familiar POKE 559,0? Every time a GRAPHICS command is executed—or when the computer boots up—the operating system writes to that location to turn the screen off, and then on when the new display list is generated. This is trapped by the emulator, and sets off a long chain of events.
First, the flag dma is set to indicate that graphics are being displayed. Then a call is made to do__display(). This complex routine then traverses the display list and generates the DL structures. At the same time, the bytes making up the display list have their status bytes set to 64. Also, all displayed bytes are marked with a 65, and the plot__x routines are called. This loops until the end of the display list is reached, or the ST's screen is full. Remember, the ST can't display more than 200 scan lines. The 8-bit could display up to about 230, so some screens may get chopped off.
Once the screen is redrawn, any write to screen memory will get trapped because of the value 65 in the status byte. The routine do__byte() is called, which then quickly goes down the DL array until it finds a DL that corresponds to the screen byte being displayed. Note that this offers a significant speed increase over simply going down the display list, since a graphics 8 screen might have a 200-byte display list, but only a two-entry DL list. From the other information in the DL structure, we can then easily figure out the screen X- and Y-coordinates.
This almost completes the description of the graphics handling. When the display list is modified, or a POKE 559,0 is executed, the routine clear__stat() quickly goes through the DLs and clears the status bytes of all screen locations, so that a new call to do__display() can be made.
As mentioned in the previous section, player-missile graphics are not supported in this version of the emulator. Although it is not too difficult to simulate some of the memory locations required for PMG, it became obvious that the problem would be not with drawing sprites, but, instead, with erasing them. For example, when a sprite is moved on-screen, it requires a write to one memory location (on an 800). On the ST, which doesn't have real sprites, this requires undrawing the sprite by replacing the graphics that were below it, then redrawing it. Since a single sprite may occupy as much as one-quarter of the entire screen, this means that thousands of the ST's screen locations would have to be written to. I have included some code in __X ATARI.C which can be hacked on to make it draw sprites, but no code is included to erase them.
The solution will probably be the blitter chip.
This same problem also exists with display lists, but since display lists are much less likely to change drastically, it can be tolerated. However, any assistance the Blitter can provide will probably help a lot.
Joysticks ports can't be fully implemented on the ST, due to its lack of support for paddles. Also, support for only two joysticks can be provided, and, since the ST's ports are not capable of output, plug-in peripherals cannot be used on the ST. But this is still adequate for most software, which simply reads joysticks.
To read the joysticks on the ST, one cannot simply PEEK a memory location, as on the 800. Instead, an interrupt is generated every time a joystick event occurs, whether it be the pressing of a button or a stick movement.
The interrupt routine is installed into a table of vectors known as kbdvecs. This table has nine vectors which point to handlers for such events as keyboard, joystick and MIDI input. The seventh vector is the pointer to the joystick handler. At entry into the joystick handler, A0 points to a 3-byte "packet." My routine Stick then reads two of the bytes which give the current status of the two joysticks.
The first byte, telling us which joystick generated the interrupt, is ignored, since it is faster to just read both bytes than to do extra processing to determine which byte should be read. The interrupt is switched on and off with the routines JoyOn and JoyOff, found in __XATARI.C.
Although the ST handles the keyboard in the same manner as the joysticks, it was almost unnecessary to write a keyboard handler, since I can easily call Bconstat() and Bconin() to get keys from the keyboard. The problem is that all the nonshift keys auto-repeat on the ST keyboard, and there's no way to tell if a key is still being pressed. We can only find out when it gets pressed. This presents a problem when trying to emulate the START, SELECT and OPTION keys, since they clearly do not and must not auto-repeat. Also, we have to know at any point in time, if any of those keys are still pressed down.
The keyboard handler has three routines: Install__Key(), Remove__Key(), and KeyPatch. Install__Key() installs Key-Patch as the ninth vector in the kbdvecs table. Remove__ Key() un-installs it.
KeyPatch loads A0 with $FFFFFC00, which is the address of the hardware register where the keycode appears. It then reads the keycode and compares it against a list of keycodes for the keys F7, F8 and F9. Fortunately, an interrupt is generated both when a key is pressed and when a key is released, making it possible to monitor the state of the keys.
At address $FFFFC00 is a device known as an ACIA, which handles all joystick, mouse and keyboard events. It generates a code from $00 to $FF. What the ROM keyboard routine does is check the keycode; if this is $FE or $FF, it calls the joystick handler. If the code is $F6 to $FD, it calls the mouse handler and anything else is treated as a key-code. Note that this limits the maximum number of keys on the keyboard to $76 or 118. It also means that the joystick and keyboard routines could probably be merged into one routine. Any takers?
Vertical blank interrupts were a bit tricky to implement. The problem was that every sixtieth of a second, the emulator had to somehow drop whatever opcode it was about to simulate and jump into a vertical blank routine, followed by a deferred vertical blank routine, and then go back to the original opcode.
The key lies in the DISPATCH macro. Note that, since pemul is already used to divert the dispatcher, it can be made to divert it straight into a vertical blank routine.
There are actually two vertical blank routines. The first, called VBI, does a few things the Atari 800's system VBI does, like incrementing the real-time clock and checking joysticks. This way, the real-time clock (locations 18, 19, 20) retains its accuracy, which keeps programs that depend on it up to speed. VBI then also changes the pemul vector to point, not to emul but, instead, to a routine called sysvbl. Then, when the current opcode being simulated finishes, the DISPATCH macro jumps to sysvbl, the second VBI routine. In that routine, we do things like update the color registers and check the keyboard.
Finally, we simulate a 6502 interrupt by pushing the A, PC, and P registers to the 6502 stack. Then we make an indirect jump through the deferred VBI vector ($224) to a routine which must end in an RTI.
This all results in vertical blank routines that run at real time. So things that most games have, like background music, will play at normal speed, even though the game itself plays at 20 percent of the speed. Some games which are totally VBI based, like ANALOG Computing's Maze War (issue 36), will run in real time.
Operating system simulation (P: and D:).
One of the main problems with emulators is their slow speed. It's a lot easier to run the real thing than to try to translate the code on another machine, even a faster one. One way to increase speed is to take commonly used pieces of code and replace them with simulation routines. For example, the CIO call in the Atari 800XL operating system could be trapped, and a C language routine executed instead. This would actually increase the speed of the emulator so that, if the operating system is called a lot, it could run faster than the real computer. In fact, by replacing the whole operating syste—and BASIC—with simulation routines, the emulated version could run many times faster than the real computer. Imagine BASIC XL running at ten times the speed.
Back to reality. Operating system call trapping could be implemented in the same way as hardware trapping, using the same serv__hdwr array. I haven't done any such trapping in this version of the emulator, because this method is slower than another simpler method: using the one-hundred or so unused opcodes to call emulator routines.
The principle is very simple. Suppose that one wanted to rewrite the output routine for the E: device. That's the well-known $F6A4 entry point. By putting an invalid opcode at that location, say opcode $FF, and making the appropriate entry into the vec__6502 array, anytime the program counter reaches $F6A4, it loads the opcode $FF and jumps to our new routine, instead of executing the original code in ROM.
This is exactly how the P: device is emulated. In the routine InitMachine() is code which places these unused opcodes at the six entry points to the P: handlers. Five of these are patched with the opcode $7F, which simply stuffs a value of 1 in the Y-registers and returns to CIO. The P: putchar handler is patched with $6F. That routine (op6F) then calls Bconout to print through the ST's printer port.
In a similar—but more complicated—manner, the D: device is emulated. The opcodes $0F, $1F, $2F and $3F are used to divert CIO to the routines that simulate OPEN, CLOSE, PUT and GET. Each of the routines makes the appropriate calls to GEMDOS and exits with the 6502 memory set as if a real DOS routine had just executed.
The patches for the D: handlers are actually made to the C: device handlers, and then C: is renamed to D: in the 6502 ROM. Best of all, we get the benefits of DOS without having Atari DOS loaded into memory, so most programs will enjoy about 5K more space. Similarly, the devices E:, S: and K: can be patched to call our own routines. We could even install new drivers, like R: for a modem.
What can be done to speed things up.
As I already mentioned, the code could be expanded for some speed increases. Also, many of the C routines could be rewritten entirely in 68000 code, but that will result in huge source code.
As also mentioned, by emulating the entire operating system, all calls to any device would be much faster. For example, the screen editor could run at real time, and even faster. Plotting and line drawing would be lightning fast, since Line A could be called. The floating-point routines could be rewritten in 68000 to execute at ten times their normal speed. One could even go as far as to rewrite Atari BASIC in 68000. Imagine BASIC running at ten times its normal speed!
Of course, all these improvements will take time and will still not solve one problem: any program that doesn't call the operating system (and many binary files don't) will not speed up at all, since all of its code will still be interpretted. Also, some BASIC programs will probably be unusable at ten times their normal speed.
But what about programs running on the current emulator? If they're in BASIC, they most likely get their timing from FOR NEXT loops. All that has to be done in most cases is to trim the loops by about a factor of 5. If the program gets its timing from the real-time clock, then there's no problem. For machine language routines, a similar reduction in loops can be done.
One could also simply use faster 6502 routines and emulate them. The file __FASTCHP.FPX contains the code for the Newell Industries Fastchip floating-point routines, which triple the speed of most floating-point operations. Simply rename it to __FASTCHP.FP and delete __ATARI.FP.
The BASIC XL runtime package works fine with the emulator, so it could be used in place of Atari BASIC to run BASIC files.
The file __NEWELL.OSX can be renamed to __NEWELL.OSB, and __ATARI.OSB deleted. The Newell operating system offers some enhanced and faster functionality, such as access to graphics modes 12 through 15.
Emulating other computers.
Anyone interested in modifying the ST Xformer to emulate other machines is free to do so. If it's to be a 6502-based machine, the file __X6502.C can be left untouched. Then __XFORMER.C has to be changed to simulate the DOS of the new machine, and __XATARI.C should be renamed and totally rewritten for that particular hardware.
If it isn't for a 6502-based machine, the __X6502.C should be renamed and the opcode handlers rewritten.
A word of warning: Other manufacturers won't be too pleased about emulators on the ST running their computers' software. I got an unfriendly response from Apple regarding my Apple II emulator. The same can be expected from other companies, since they're interested in selling their machines, not STs. Of course, by emulating the entire operating system of each particular machine, you can get around that, but then you suffer from the problem of lower compatibility. Look at the case of PC clones which will not run some real PC software. Another way around the problem is to take the path of the Magic Sac, but then you no longer have a software-only emulator—and costs are much higher.
Most other computers should be much more easily emulated. For example, the code for the Apple emulator is about 50K shorter and runs at about 40 percent of the speed of an Apple. This is due to the Apple's slower clock speed, simpler graphics modes and total lack of any interrupts.
I hope that this explanation, along with a long printout of the program, will give you an insight into the process of emulating one computer on another. Some of you may even be encouraged to write your own versions of the emulator for other microprocessors, like the 6809 and Z80. If enough people work on this program, most of the major 8-bit machines will be emulated.
With the forthcoming 68020-based Atari TT, the speed of the emulator may increase five-fold, to the point where the emulated software will run at the same speed as the real thing. Thus, the Atari TT may become a completely universal machine, capable of running most available software on all machines.
Anyone having any further questions, code improvements, or a list of programs that work with the emulator, can contact me on Delphi (username DAREKM), CompuServe at 73657,2714 and on GEnie (also DAREKM). I would like to maintain one master copy of all improved versions, which would then be released periodically with an updated working program list.
If enough interest is generated, perhaps Delphi, CompuServe or GEnie might even set up a separate download section of software known to work with the emulator, as is currently done with Mac software for the Magic Sac.