Programs that understand languages; how they do it

Programs that understand languages; how they do it - syntax-directed methods. William Wright.

Language is an interaction between words. Arranging the same words in different ways causes different interactions:

The boy abandoned his own dreams of happiness (Despair)

The abandoned boy dreams of his own happiness (Hope)

The interactions that occur in natural languages (our conversation and literature) are too complex for today's computing techniques. When we exchange information with a program, we must resort to an artificial language with simplified grammar and semantic rules. The purpose of this article is to explain several programming methods for understanding or parsing artificial language. These methods can't cope with natural language, but they do allows us to proceed with workaday applications of language like compilers, question answering programs, and operating systems.

An Overview

Since a language allows different arrangements of the same words, it follows that a parser cannot apply the same format specification to every sentence. An input statement like INPUT X,A$,Y is not true parsing--unless we consider the input to be a degenerated one-syntax-rule language. The following code is not parsing either, because it treats the input as a single unit rather than as a string of interacting words: INPUT A$ IF A$ V "LET I = 5" THEN ... IF A$ = "LET I = 6" THEN ... IF A$ = "LET I = 7" THEN ... ... (etc) ...

Why examine words individually? For one thing, most languages have too many sentence possibilities. The dictionary of all legal sentences would quickly exceed the memory of any computer. Equally as important, constructing a dictionary of sentences would imply that we were able (and willing) to anticipate all the ideas that the user of the language would want to express. If languages were merely dictionaries of previously defined statements, we couldn't use them to express new ideas or ask new questions.

The implications don't stop here either. If we must consider the assembly of individual abstractions (words) into complete ideas (sentences), then we are considering artificial intelligence. One of the enticements of language processing is that it allows (forces?) us to experiment with the ultimate computer application: machine intelligence. Most of the parsing techniques described in this article apply to artificial intelligence just as much as they do to language.

For the remainder of this article, word will mean any character string that is an elemental unit of meaning in a language. Thus numbers (strings of digits) and arithmetic operators (such as + and -) are words in most programming languages. A sentence is any string of words that satisfies the rules of the language for sentences.

An Introductory Parsing Automaton

We use the term automaton or machine to suggest that the response of the parser to a sentence is automatic and predetermined. The automation consists of a loop and a data table. The loop successively examines each word in the sentence and uses the information in the table to spot syntax errors, to translate the words, and to integrate the translations into a complete meaning for the sentence. The table is organized into rows. Each row is called a state and represents the rules (syntactic and semantic) for a unique use of a particular class of words: STATE1: WORD1 ACTION1 NEXT1 WORD2 ACTION2 NEXT2 WORD3 ACTION3 NEXT3

ERROR ERR-MSG STATE5: WORD5 ACTION5 NEXT5 WORD6 ACTION6 NEXT6 ERROR ERR-MSG

In a job control language, STATE3 might represent a use of filenames, while STATE6 might represent some other use of filanames. STATE2 might be a use of integers, and so on.

WORD is the name (or other symbolic representation) of a subroutine that knows the spelling rules for a class of words. Typical word classes in a programming language might be: integer, floating point number, arithmetic operator, comment delimiter, variable name, or keyword. WORD examines the current word from the sentence and decides whether or not the word satisfies the spelling rules of the class. For example, the subroutine to test if the current word is a properly spelled integer might look like this: SPELL=1 FOR L = 1 TO LEN (WORD$) I$ = MID$ (WORD$, L, 1) IF I$ < "O" OR I$ > "9" THEN SPELL = 0 NEXT L RETURN

After calling this subroutine, the parser can check SPELL to see whether the spelling of the word was accepted or rejected (1=accepted, 0=rejected).

The parser begins by calling WORD1. If WORD1 doesn't accept the first word of the sentence, the parser will call WORD2, and so on. If none of these states accepts, the parser will arrive at ERROR, which is a subroutine that prints the error message named by ERR-MSG.

On the other hand, if one of the states accepts, the parser will call the ACTION for that state. For example, if WORD3 accepts, the parser calls the subroutine named by ACTION3. Action subroutines are part of the main program that called the parser originally. The parser is returning control to the main program for a few moments, so that the program can do something with the word. Depending on the application, the action might be as simple as storing the translation somewhere, or it might be a complex set of calculations. ACTION is where the semantic meaning of the word is analyzed.

After ACTION returns, the parser moves to the state named by the NEXT of the accepting state. If WORD3 accepts and NEXT3 contains the name of STATE5, the parsing loop will begin at STATE5 when it examines the next word in the sentence.

In summation, the table is a compact and convenient technique for instructing the parser: "Once you identify something in the sentence (WORD), I will tell you what to do with it (ACTION) and what to accept next from the sentence (NEXT)."

A flow chart is shown in Figure 1. We will discuss the BAD-ACTION box later. To illustrate the operation of automaton, consider there two sentences from a hypothetical programming language: GO:INC SIZE BEQ GO

In these sentences, GO and SIZE are symbolic names. INC and BEQ

are executable instructions (increment and branch-if-equal). The

table might look like this: STATE1: INSTRUCTION OUT-INSTR STATE6 STATE2: SYMBOL DEF-SYMBL STATE4 ERROR NOT-INSTR STATE4: COLON OK-COLON STATE1 ERROR NOT-COLON STATE6: SYMBOL LOOK-SYM 0 ERROR NOT-SYMBL

During a parse of the first sentence, the sequence of events would be:

STATE1 rejects GO because it isn't an INSTRUCTION

STATE2 accepts GO because it is a SYMBOL, calls DEF-SYMBL, and then moves to STATE4

STATE4 accepts: because it is a COLON, calls OK-COLON, and then moves to STATE1

STATE1 accepts INC because it is an INSTRUCTION, calls OUT-INSTR, and then moves to STATE6

STATE6 accepts SIZE because it is a SYMBOL, calls LOOK-SYM, and the exits (because NEXT = 0)

For the second sentence, the events would be:

STATE1 accepts BEQ, calls OUT-INSPR, and then moves to STATE6

STATE6 accepts GO, calls LOOK-SYM, and then exits (because NEXT = 0)

In this hypothetical language, symbolic names have two different usages, and the automation calls a different ACTION for each usage. When the symbol begins a sentence (as in GO:), the symbol defines a location in the source code, and the automation asks DEF-SYMBL to make a record of the location. When the symbol follows an instruction (as in BEQ GO), the symbol is a reference to a location defined elsewhere, and the automation asks LOOK-SYM to look it up.

Moving from one state to another is called transition, suggesting that the expectations of the parser will change each time it recognizes a word in the sentence. Initially the automaton expected either an instruction or a symbol (STATE1/STATE2). After recognizing a symbol (GO), the automaton expected only a colon (STATE4). Each NEXT predicts the word classes that might appear next in a legal sentence.

The completion of the parser is signalled by the table (NEXT=0), not by the end of the sentence. One of the duties of the parser is to decide whether the sentence contains either too few or too many words. If the final word does not appear exactly when expected, the sentence contains a syntax error. (Carriage-return is a word in most languages.)

The parser is in trouble if it reaches an ERROR. Since no state accepted the current word, the parser can't know which NEXT to use for the rest of the sentence. The parser must abort because it cannot predict what words should come next. This is an important issue to which we will return later.

We have glossed over the difference between action and translation. For example, all integers receive the same translation--character string to numerical quantity--but they cause different actions depending upon their usage. In Basic, an integer can be used as either a statement label or a number (e.g., 10 A = 10). To make the parser more compact, we can place the translation operation in WORD rather than repeating it in several different ACTIONs. Every ACTION must know where its WORD left the translation. This technique applies only when the translation of a word is independent of its usage in a sentence.

State That Call Other States

A phrase can have more than one use, just as a word can. Consider an arithmetic expression. Programming languages allow an arithmetic expression after =, before or after a relational operator, as an array subscript, etc. To parse the expression, the automation will need a sequence of states. Rather than cluttering the table with repetitions of this sequence, we can limit the table to a single instance which other states can call--as program calls a subroutine. This is how it would look: CALL SEQUENCE ACTION NEXT

CALL is a special subroutine that saves (makes a copy of) the location of the current state and substitutes SEQUENCE in its place. As a result, the sequence will direct parsing until something tells the automaton to return to the calling state (by reversing the substitution). NEXT = 0 and ERROR are the obvious candidates for causing a return. NEXT = 0 will mean a successful return (the sequence accepted), and ERROR will mean an unsuccessful return (the sequence rejected).

An illustration is shown below. Both STATE6 and STATE12 call the sequence at STATE16. If the sequence accepts (STATE18 accepts), the automaton will return and execute ACTION6/ACTION12. Then the automaton will move to STATE49/STATE92. If the sequence rejects (reaches STATE17), the automaton will return and fall through to ERROR7/ERROR13 with calling any ACTION.

Notice that ERROR 17 does not have an ERR-MSG. This is because an error has not occurred yet. We are using ERROR as a shorthand for "call failed," not for "parse failed." The section of the table that issues the call must decide whether or not an error has occurred. Another point: since the call may have processed several words before reaching ERROR, the automaton must be prepared to restore the input pointer to its original value before it returns from the call.

Preliminary Lexical Scan for Efficiency

The automation has an inefficiency: the same word will be checked for spelling by many states until one state accepts or all of them reject. It would be more efficient to check the spelling of each word only once, before the parse begins, to identify its lexical type: digit string, alphabetic string, arithmetic operator, etc. With this information, the parser could avoid calling many WORDs unnecessarily. For example, if the word is "49," it would be a waste to execute the WORDs for variable names and keywords (which must begin with letters).

Other Refinements

If a word class contains only one word (e.g., COLON contains only :), WORD can be the actual word rather than the name of a subroutine. This eliminates the overhead of a subroutine call. The automaton must represent subroutine names with symbols that can't be confused with printed characters.

Some states won't need an ACTION. For example, punctuation marks often serve as separators and don't carry any information that requires action. Rather than cluttering the table with ACTIONs that do nothing, we can have a convention (such as setting the high bit of WORD) that indicates no action.

Sometimes it is convenient to call ACTION or move to NEXT without processing any words from the sentence. For this, the automaton needs a WORD that does nothing or a convention that indicates no WORD. For example, the automaton may want to hop over an ERROR if a missing word was optional, or it might want to perform a final action after a complete phrase or sentence has been processed.

Semantics vs. Syntax

Words can satisfy the syntax of a language while violating its semantic rules. For example, 50 GOTO 1000 is good syntax, but it would be illegal semantics if no other sentence began with 1000. To handle such situations, the parser has a BAD-ACTION routine to which any ACTION can branch in case of semantic error. BAD-ACTION does whatever bookkeeping is required to simulate ERROR. The flow chart in Figure 1 shows a BAD-ACTION.

We have been discussing syntax-controlled automatons. The spelling and location of each word in the sentence determines unambiguously which ACTION should be called and which state should be used as NEXT. However, there are situations in which we would like semantics (action subroutines) to influence the operation of the automaton. For example, assmbler programming languages use the following syntax for an executable instruction: LABEL: INSTRUCTION OPERAND ;COMMENT

The difficulty is that different instructions demand different types of operand. We could solve this problem by defining different instruction types and assigning a different word class to each type. The parsing table would have a separate state (each with its own ACTION and NEXT) for each class. However, a better solution might be a single state whose WORD accepted all instructions and whose ACTION knew which NEXT to use for each instruction.

In other words, the automaton allows ACTION to overrule or replace the NEXT in the table. If nothing else, this improves the efficiency of the parser because it avoids calling several WORDs to identify a single instruction. Also, this technique concentrates the knowledge the parser has about operand types in a single action subroutine, rather than scattering the knowledge throughout a large table.

This technique admittedly violates some of the axioms upon which formal automaton theory depends, but no formal theorem can cope with all aspects of language anyway. We might say that our action subroutines are "intelligent" enough to cope with ambiguous or even contradictory syntax. One of the unsolved problems of all machine intelligence applications, including language processing, is how a program should decide when to abandon rigorous logic and "play the odds" instead. Perhaps this famous example will whet your appetite:

Time flies like an arrow.

A computer program at Harvard University found four syntaxes that match the above sentence:

* Time moves through the air in the same manner as an arrow does. (Flies is the verb)

* Measure the speed of a housefly by the same means as you would measure the speed of an arrow. (Time is the verb)

* Measure the speed of houseflies that resemble an arrow. (Like ... an adjectival phrase, not adverbial)

* A species of housefly, called a "time-fly," admires an arrow. (Like is the verb)

None of the above captures the real meaning of the sentence: Time passes as quickly as an arrow in flight. Obviously, we need something besides syntax to understand this sentence. For those interested in language as an experiment in machine intelligence, the following references contain interesting chapters and are more readable than most.