Classic Computer Magazine Archive A.N.A.L.O.G. ISSUE 66 / NOVEMBER 1988 / PAGE 25

BOOT
CAMP


Light Torch...Grid Loins
...Boot Assembler


by Karl E. Wiegers


How many of you have ever played a computer adventure game of some sort? I see a lot of hands in the air. (Great eyes, no?) And if you've ever tried to write one yourself, you quickly discovered that even a simple adventure game involves some pretty sophisticated programming. The ever-adventurous Clayton Walnum once wrote an insightful three-part series on how to design and program your own adventure games.

    Blow the cobwebs off issues 39, 40 and 41 of ANALOG Computing, from early 1986, and re-read what Clayton had to say. It's okay, I'll wait here until you're done.
    Back already? Then you've learned many things (or else you were watching Dallas reruns while you were supposed to be studying). Clayton's articles told you (among other useful stuff) that the heart of an adventure game is its "parser. " The parser is the program code that lets the computer interpret commands you type and take some appropriate action. It is the parser that gives the computer some appearance of being intelligent. Of course, computers aren't intelligent in the least; good parser programmers are.
    In reality, parsers are useful for much more than simply exploring dungeons. Many kinds of computer applications can benefit from a user interface that at least attempts to understand natural language communications. While it's pretty hard to get a computer to understand spoken instructions, the written word can be interpreted a bit more easily.
    In the next Boot Camp or two, we'll see how a very simple parser can be implemented on the 8-bit Atari, using some assembly language for the time-intensive parts. You really aren't likely to write a complete application program in assembler around this word-searcher nucleus. Hence, we'll set up a simple BASIC program structure that interfaces to the machine-language parser routine, to show you how it all fits together.
    Now, what subject area should we use to illustrate the fine art of parsing? Adventure games are kind of passe by now. Wait! I've got it! Imagine the kitchen of the future, automatically assembling ingredients in the quantities and sequence you specify, popping the result into the oven, and doing everything for you except eating the food. Let's write a general-purpose parser in assembly language, then cook up a BASIC program that might be used someday in Karl's Komputerized Kitchen.

In reality, parsers are useful
for much more than simply
exploring dungeons.


The joy of parsing

    There are three main aspects to a natural language-processing program or parser: 1) to take the input string apart into separate words and/or numbers, 2) to attempt to identify the individual words by looking them up in a vocabulary list, and 3) to try to understand what the input "means"; that is, see if the words identified in the input string constitute a recognizable instruction that can be executed by the program.
    Let's look at all three of these functions in more detail.

Dissecting the input

    The basic idea of natural language processing is that the user (who is presumably a human being of some sort) can present instructions or queries to the computer much as he would communicate with another human being. I'm sure you recognize how amazingly complex this kind of communication really is. After all, you use all sorts of shortcuts in your verbal and written communications, yet other people who know the same language can usually figure out what you're trying to say. We have to be pretty creative to do something similar with a microcomputer.
    Take a simple instruction of the sort you might give to a computerized kitchen when you want to bake a cake: "Slowly mix in two cups of brown sugar." Most of you should have a picture in your mind of what this means. But how do we get the computer to understand it?
    The first step is to break the input string into separate words. The simplest way to do this is to look for blank characters as delimiters between words. But what if the program user entered more than one blank between words, or used a punctuation mark (comma, semicolon, period, etc.) to separate words instead of (or in addition to) the blank? For simplicity, we'll decree that only single instructions can be entered. This means we don't have to look for complex sentences such as, "Melt the butter, then stir in the flour." So, most punctuation marks can be disregarded.
    However, we can't just ignore periods. What if you want to add 3.5 cups of flour? The period here is really a decimal point in a number. Obviously, when we split the input string into words, we must distinguish numbers from true words that we'll be trying to find in the vocabulary list.
    The moral of the story is that the natural language interface involves some "preprocessing" of the user's input. This step discards symbols like certain punctuation marks and builds a list of words for which we must search in the known vocabulary. Any numbers or other anticipated special character strings will be identified and set aside until we get to Step 3, in which we try to make some sense out of the input.
    The preprocessor can be a part of the parser code itself or it can be a separate routine. For simplicity in this example, I'll put the preprocessor into the BASIC program. If execution time is critical, you would want to recode it into assembler, but BASIC will be fine for our purposes.

Do I know you?

    Once your preprocessing step has come up with a list of words from the user's input string, we need to see if the words are "known" to the program. Your vocabulary list should be separate from the parser code itself, since a general-purpose parser routine could then be used for many applications having different vocabularies. The limited RAM in 8-bit microcomputers can really restrict the size of the vocabulary in a given program, because you still need some memory for the rest of the program.
    The simplest approach is to put all the words you want the program to recognize into the vocabulary list. Some alternative techniques provide more efficient use of memory, thus allowing larger vocabularies. Have you ever noticed that some spelling-checker programs appear to require far less memory than it seems like they need to handle; say, 30,000 words? Data compression methods and clever algorithms can be used to substantially reduce the amount of memory consumed by a block of information. However, we'll leave such techniques to the experts and stick to the brute force approach.
    Another consideration is how much latitude you wish to give the user regarding different ways he can enter equivalent commands. For example, do you wish to distinguish between uppercase and lowercase letters? This can be important for proper names, like in a quiz on U.S. presidents. Do you want the user to be able to get away with a certain degree of misspelling? One way to handle this is to add anticipated misspellings of particular words to the vocabulary. A more sophisticated approach uses some algorithm to determine just how closely words must match vocabulary entries to be considered acceptable.
    In today's example, we'll translate all lowercase letters to uppercase, and the vocabulary entries will all be in uppercase. Only exact matches with vocabulary entries will be accepted.
    Yet another one of many characteristics of human, that is, interpersonal communication is that we tend to be more or less, I mean pretty much, wordy lots of the time, you know? Think back to "slowly mix in two cups of brown sugar." The words "in" and "of" certainly are superfluous to the meaning of this instruction. Prepositions and articles (a, an, the) can generally be ignored without disturbing the meaning of an instruction. Hence, we'll leave them out of the vocabulary. A word of warning: Be very careful with negation words like "not" and "don't." "Don't boil the milk" is rather different from "boil the milk"!
    How about words like "slowly" and "brown"? Adjectives and adverbs like these can be important, but not necessarily. The specific application dictates whether the actions to be taken depend on the presence of modifiers like these in the command string.
    Another important aspect of building a vocabulary is the handling of synonyms. Instructions to "add flour" and "add sugar" are obviously different. But are there differences between "add sugar," "mix sugar," "stir sugar," "beat sugar," "fold sugar" and so on? If not, these instructions are all equivalent. So, even though our parser has to locate the verb (add, mix, etc.) in the vocabulary, only one piece of program logic is required to handle all these inputs.

You have to consider how
much latitude you wish to
give the user regarding
different ways to enter
equivalent commands.



A token gesture

    Okay, so we've split the input string into words and found the words in the vocabulary list. Each word is assigned a numeric value, or "token." Each category of vocabulary entries will have a different set of tokens, and specific arrangements of tokens will make up valid instructions. Synonyms are given the same token.
    Listing 1 illustrates what I mean. This is an Atari BASIC program that creates the vocabulary file (VOCAB.DAT) for our computerized kitchen example. You can modify this program to create vocabulary files for other applications by changing the DATA statements in Lines 1000-1120. The DIM statements in Line 100 limit the length of a single vocabulary entry to 20 characters and the total length of the vocabulary file to 2,000 bytes, but of course you could change these restrictions.
    The first block of vocabulary words (Lines 1000-1030) pertains to ingredients that we think someone might want to use in describing a recipe. FLOUR is assigned the token 1, SUGAR is 2 and so on. Notice that I'm regarding BUTTER, MARGARINE and SHORTENING as equivalent ingredients, so they all are given the same token, a 5 (Line 1010). In the adventure game sense, these words correspond to the nouns that could be entered in a simple two-word command.
    Another section of our vocabulary concerns operations (verbs) the user might want to perform while cooking up something tasty. All of these have tokens in the range 20-29 (Lines 1040-1060). Again, some of them are considered to be synonymous (MIX, STIR, ADD, FOLD), and instructions containing any of these words will be handled in exactly the same way by the evaluator part of the parser.
    Vocabulary words with tokens in the 30-39 range refer to the units on some meaningful numbers that could be part of a command. These units refer to cooking time (HOURS, MINUTES) or temperature (DEGREES). Words with tokens in the 40s are units pertaining to the quantities of ingredients that are to be added (CUPS, TSP and so forth).
    When we actually get to the part of the parser that determines whether a valid instruction was entered, the program will be looking at tokens, not at actual words. Certain patterns of tokens constitute valid commands. For example, suppose the instruction string entered said, "ADD 2 COCOA." The parser would tokenize this into ingredient = 8, amount = 2 (identified as a number), and operation = 21. However, the parser logic should recognize that something is missing: units. "ADD 2 what COCOA?" Cups? Ounces? Tablespoons? It makes a difference in the final product, or so I've heard. Hence, "ADD 2 COCOA" would be flagged as an invalid instruction, because no units were specified. More about the third portion of the parser next time.

Vocabulary building

    Enough preliminaries; let's look at some more code! I said that Listing 1 is a utility program for creating a file containing a vocabulary list. The entire vocabulary list is treated as one giant character string variable, VOCAB$. For convenience of editing and for ease of reading the file, the contents of the string are written out to the VOCAB.DAT file in 40-byte records, in Lines 220-300.
    The data statements in Listing 1 contain the individual vocabulary entries and their corresponding tokens. A complete vocabulary entry in string VOCAB$ consists of: one character whose ATASCII value equals the number of characters in the word (Lines 130-140); the word itself (Line 150); and a character whose ATASCII value equals the token value for that word (Line 160). This method for storing the vocabulary limits you to 255 unique tokens, but if you needed more, you could go to a two-byte representation for the tokens; 65,535 tokens should be adequate.
    An example: MARGARINE has a length of nine characters and a token value of 5. The vocabulary entry for this word consists of CHR$(9) (Control-I), MARGARINE and CHR$(5) (Control-E). Make sense?
    Line 1130 marks the end of the vocabulary data with an exclamation mark and a token value of 0. If the vocabulary searching part of the parser gets to the end of the vocabulary list without a match, a token of 0 is returned.

If you were to write the
entire preprocessor and
evaluator parts of the parser
in assembly language, you'd
be talking about some
serious code.


The word quest

    Next month we'll look at the preprocessing and evaluator parts of the parser. For now, you'll have to settle for something simpler. Listing 2 is a BASIC program that simply loads the word-finder machine-language parser routine into RAM, reads the VOCAB.DAT file into string variable VOCAB$, lets you enter a word at the keyboard, and tells you if the word you entered is found in the vocabulary list. (Enter QUIT to exit.) This is a useful way to test whether there are any errors in your vocabulary file. For today, Listing 2 will serve as a framework for illustrating how the machine-language routine operates.
    Line 40 of Listing 2 DIMS the VOCAB$ variable to the actual length of the vocabulary file or thereabouts. The 40-byte records from VOCAB. DAT are read in Lines 140-190.
    Oh, yeah, I almost forgot: Boot Camp is supposed to be about 6502 assembly language. Well, look at Listing 3. This is the kernel of the parser, the routine that searches for a particular word in the vocabulary file. It produces only 79 bytes of object code. Shorter than you expected, eh? Well, it really isn't doing anything particularly fancy. Of course, if you were to write the entire preprocessor and evaluator parts of the parser in assembly language, you'd be talking about some serious code. The preprocessor could be written so as to be generally useful in any natural language program. However, the evaluation code is necessarily specific to each application.
    The machine-language routine in Listing 3 is intended to be called from BASIC by means of the USR function. It is relocatable, so it can be loaded at any address you like. Assemble this listing and create a disk file called PARSER.OBJ.
    Listing 2 reads the machine-language code from PARSER.OBJ. You may recall that binary (object code) files contain six bytes of header information. Lines 60-80 of Listing 2 read these six bytes and throw them away. Lines 90-120 read the actual object code and load it into a string variable called ML$. An alternative approach is to read the object code and poke it into some safe place in RAM. Since the code is relocatable, this can be secure anywhere, so long as you know the starting address.
    The assembly routine uses six RAM locations for specific purposes. Zero-page bytes $CB-$CE (decimal 203-206) are needed for the indirect indexed addressing mode, as we have seen so many times before. Location $6FE (decimal 1790) contains the length of the word being sought in the vocabulary; this serves as input into the machine-language routine. Location $6FF (1791) contains the word's token value (or a zero if not found); this is the parser's output. You can change these if they conflict with other uses for those addresses in your own programs.

It's not a bad idea to do
some error checking to
make sure that the
accumulator contains a "2"
after the PLA operation.

    To call this machine-language routine, first poke the length of the target word (in variable WORD$ in Listing 2) into location 1790 (Line 240 of Listing 2). Then call the machine-language routine with the USR function as shown in Line 250, specifying the addresses of the ML routine itself, the vocabulary variable string, and the targetword variable string. Faster than a speeding bullet, the token value for the word appears at Address 1791, unless the contents of WORD$ aren't found in VOCAB$, in which case a zero appears in address 1791. Pretty simple, eh?
    Let's look at the assembly code a little more closely. Lines 450-530 in Listing 3 illustrate how to handle arguments passed from BASIC to machine language. These, you may recall, are passed via the stack, in two-byte chunks. The PLA in Line 450 removes a onebyte counter of how many arguments were actually passed. It's not a bad idea to do some error checking to make sure that the accumulator contains a "2" after the PLA operation; otherwise, a computer lockup is likely. Next on the stack are the high byte and low byte of Argument 1 (address of VOCAB$), followed by the high byte and low byte of Argument 2 (address of WORD$).
    The searching algorithm is really very simple. It begins by comparing the length of the target word to the length of the current vocabulary entry being pointed to by VOCAB (Lines 680-700, with a branch down to Line 790). If the lengths are different, the words obviously don't match, so control passes from Line 810 to label NEXTWORD at Line 1090. Lines 1100-1320 simply change the contents of VOCAB to point to the next word in VOCAB$, by skipping ahead a number of bytes equal to the length of the current word plus 2 (one for the length, one for the token).
    If the target length did match the current vocabulary entry length, a character-by-character comparison is done in Lines 820-960. This comparison actually starts with the last character in the word and works backwards. If all characters match (Ta-da!), we have a hit. Lines 970-1010 fetch the token value at the end of the current vocabulary entry, store it at address RESULT ($6FF, 1791), and return to the BASIC program. If the entire vocabulary list is searched with no match, RESULT contains a zero (Line 710).

Conclusion

    So there you have it. A very simple assembly-language program for searching an arbitrary list of vocabulary entries to see if a target character string can be identified. Next time, we'll see a way to package this vocabulary searching part of the parser with preprocessor and evaluator routines in BASIC to show just how smart a program has to be to make Karl's Komputerized Kitchen a reality.

Acknowledgement

    I'm indebted to Dr. Bruce Argyle of Mad Scientist Software for sharing his parser code and concepts with me. Thanks, Bruce.



Listing 1:    BASIC

IE 10 REM Program name: VOCAB.BAS
AT 20 REM Utility program to build
FL 30 REM Vocabulary list file for parser
ZM 40 REM demo program in "Boot Camp"
BC 50 REM
DD 60 REM by Karl E. Wiegers
BE 70 REM
KD 80 REM First build VOCABS string
BG 90 REM
BY 100 DIM TERM$(20),VOCAB$(2000),TEMP$(4
   1)
YU 110 A=1:PRINT "Building vocabulary..."
ZS 120 READ TERM$,TOKEN
AS 130 INLEN=LEN(TERM$)
IK 140 VOCAB$(A,A)=CHR$(INLEN)
MO 150 VOCAB$(A+1,A+INLEN)=TERM$
OD 160 VOCAB$(A+1+INLEN,A+1+INLEN)=CHR$(T
   OKEN)
AZ 170 IF TERMS="!" THEN GOTO 200
YG 180 A=A+INLEN+2
MU 190 GOTO 120
DU 200 REM Then print data to file
PB 210 PRINT "Saving in file...."
BV 220 OPEN #2,8,6,"D:VOCAB.DAT"
WM 230 MAX=INT(LEN(VOCAB$)/41)+1
UQ 240 FOR I=1 TO MAX-1
LV 250 TEMP$=VOCAB$(40*I-39,40*I)
WY 260 PRINT #2;TEMP$
GG 270 NEXT I
XG 280 TEMP$=VOCAB$(40*MAX-39,LEN(VOCAB$)
   )
XE 290 PRINT #2;TEMP$
LL 300 CLOSE #2
XM 310 PRINT "All done!"
NW 320 END
EC 1000 DATA FLOUR,1,SUGAR,2,OIL,3,MILK,4
AP 1010 DATA BUTTER,5,MARGARINE,5,SHORTEN
   ING,5
AF 1020 DATA NUTS,6,WATER,7,COCOA,8
WU 1030 DATA EGG,10,EGGS,16
ZC 1040 DATA MIX,21,STIR,21,FOLD,21,ADD,2
   1
SB 1050 DATA WHIP,22,BEAT,22
JR 1060 DATA PREHEAT,23,COOK,24,BAKE,24
AI 1070 DATA HOURS,31,MINUTES,32,DEGREES,
   32
IV 1080 DATA CUP,41,CUPS,41,C,41
BO 1090 DATA TEASPOON,42,TEASPOONS,42,TSP
   ,42,TSPS,42
ZX 1100 DATA TABLESPOON,43,TABLESPOONS,43
UK 1110 DATA TBSP,43,TBSPS,43
PM 1120 DATA OUNCE,44,OUNCES,44,OZ,44,OZS
   ,44
VF 1130 DATA !,0


Listing 2:    BASIC

NN 10 REM Sample program to demonstrate
UD 20 REM vocabulary searching for words
XD 30 REM entered by user
BK 35 REM
AS 40 DIM VOCAB$(280),TEMP$(40),WORD$(20)
   ,ML$(79)
NN 50 OPEN #2,4,0,"D:PARSER.OBJ"
FP 60 FOR I=1 TO 6
EF 70 GET #2,A
IW 80 NEXT I
PJ 90 FOR I=1 TO 79
CB 100 GET #2,A
AR 110 ML$(I)=CHR$(A)
FV 120 NEXT I
LP 130 CL05E #2
ZY 140 OPEN #2,4,0,"D:VOCAB.DAT"
OC 150 FOR I=0 TO 6
RE 160 INPUT #2,TEMP$
YJ 170 VOCAB$(I*40+1)=TEMP$
GH 180 NEXT I
MB 190 CLOSE #2
TZ 200 PRINT CHR$(125)
AO 210 PRINT "Enter a word (QUIT to exit)
   :"
XA 220 INPUT WORD$
IM 230 IF WORDS="QUIT" THEN STOP
ZL 240 POKE 1790,LEN(WORD$)
JH 250 X=USR(ADR(ML$),ADR(VOCAB$),ADR(WOR
   D$))
GP 260 IF PEEK(1791)=0 THEN PRINT "Sorry,
    I don't know ";WORD$
BI 270 IF PEEK(1791)>0 THEN PRINT "Token
   for ";WORD$;" is ";PEEK(1791)
TT 280 PRINT
MU 290 GOTO 210


Listing 3:    Assembly

0100     .OPT OBJ,NO LIST
0110 ;
0120 ;Vocabulary searching routine, to
0130 ;be used as part of a natural
0140 ;language parser program
0150 ;
0160 ;by Karl E. Wiegers
0170 ;
0180 ;This machine language subroutine
0190 ;is designed to be called from a
0200 ;BASIC program. It takes two
0210 ;arguments: the address of the
0220 ;Vocabulary data string, and the
0230 ;address of the variable that
0240 ;contains the word being searched
0250 ;for, like this:
0260 ;
0270 ;X=USR(loc,ADR(VOCAB$),ADR(WORD$)
0280 ;
0290 ;
0300 VOCAB = $CB
0310 WORD =  $CD
0320 LENGTH = $06FE
0330 RESULT = $06FF
0340 ;
0350 ;routine is orged at $600, but is
0360 ;relocatable
0370 ;
0380     *=  $0600
0390 ;
0400 ;-------------------------------
0410 ;set up arguments passed from
0420 ;BASIC, into page 0 variables
0430 ;-------------------------------
0440 ;
0450     PLA
0460     PLA         ;pointer to start
0470     STA VOCAB+1 ;of vocabulary
0480     PLA         ;character string
0490     STA VOCAB
0500     PLA    ;pointer to start
0510     STA WORD+1 ;of word being
0520     PLA    ;searched for in
0530     STA WORD    ;Vocabulary list
0540 ;
0550 ;-------------------------------
0560 ;search routine starts here with
0570 ;next word being pointed to by
0580 ;VOCAB Variable; branch to label
0590 ;ANALYZE to look for match; if
0600 ;no match, return to here; last
0610 ;'entry' in VOCAB$ has token of
0620 ;0, so store that in RESULT and
0630 ;return to BASIC program
0640 ;-------------------------------
0650 ;
0660     CLD
0670 BEGIN
0680     LDY #0
0690     LDA (VOCAB),Y
0700     BNE ANALYZE
0710     STA RESULT
0720     RTS
0730 ;
0740 ;-------------------------------
0750 ;see if length matches that of
0760 ;next word in vocabulary
0770 ;-------------------------------
0780 ;
0790 ANALYZE
0800     CMP LENGTH  ;lengths match?
0810     BNE NEXTWORD ;no, go on
0820     LDY LENGTH  ;yes,check chars
0830 ;
0840 ;-------------------------------
0850 ;compare characters in target
0860 ;word with those in current word
0870 ;in vocabulary
0880 ;-------------------------------
0890 ;
0900 CYCLE
0910     LDA (VOCAB),Y ;get next char
0920     DEY         ;and compare to
0930     CMP (WORD),Y ;target word
0940     BNE NEXTWORD ;no match,go on
0950     TYA         ;matches, check
0960     BNE CYCLE   ;next char
0970     LDY LENGTH  ;found! point to
0980     INY         ;token value, get
0990     LDA (VOCAB),Y ;it, and store
1000     STA RESULT  ;in RESULT byte
1010     RTS         ;all done, so exi
t
1020 ;
1030 ;-------------------------------
1040 ;skip to next word by adding
1050 ;length of current word to
1060 ;pointer to vocabulary list
1070 ;-------------------------------
1080 ;
1090 NEXTWORD
1100     CLC
1110     LDY #0
1120     LDA VOCAB
1130     ADC (VOCAB),Y
1140     STA VOCAB
1150     BCC NOINC1
1160     INC VOCAB+1
1170 NOINC1
1180     CLC
1190     LDA VOCAB
1200     ADC #2
1210     STA VOCAB
1220     BCC NOINC2
1230     INC VOCAB+1
1240 NOINC2
1250 ;
1260 ;-------------------------------
1270 ;continue search with next word
1280 ;in the vocabulary
1290 ;-------------------------------
1300 ;
1310     CLC
1320     BCC BEGIN