Programs that understand language; how they do it - syntax-directed methods

Programs that understand language; how they do it - syntax-directed methods - part 2. (tutorial) William Wright.

Last month, in Part 1, we began to discuss programming methods for understanding or "parsing" artificial language. This month, we continue with an investigation of top-down vs. bottom-up parsers and conclude with a discussion of recursive subroutines.

Top-Down vs. Bottom-Up

The syntax of any valid sentence can be represented with a tree structure. "The boy ran down the road" can be diagrammed as in Figure 1.

The diagram is called a derivation tree. Each name on the tree is called a node. Each set of nodes is a production of the node from which it descends. In the diagram above, VERB PREP-PHRASE is a production of VERB-PHRASE. Obviously a language contains enough productions to build many different derivation trees (sentences), and it is unlikely that any particular derivation will contain all of the productions of a language.

The automatons that we discussed in Part 1 are called top-down because they deduce the top of a tree first and the bottom of a tree last. They don't output actual diagrams, but they "think" in a top-down fashion. Suppose that we were parsing "The boy ran down the road." A top-down machine would start at the top with SENTENCE.

We can think of SENTENCE as the first state of the machine. We hope that SENTENCE or one of the states that follows it will accept "the," at which time the machine will add nodes to the bottom of the tree:

We can think of ARTICLE as the state that accepted "the," and we can think of NOUN as the NEXT of the accepting state. We hope that NOUN or one of the states that follows it will accept "boy," and that the parse will continue in this fashion until the derivation is complete. Top-down parsing is called predictive because the NEXT of each state (each non-terminal node) predicts what should come next in the sentence. In the example, after the machine sees "the," it predicts NOUN and won't bother to check for ARTICLE or VERB.

Trees can be built from the bottom-up. In this case, machine begins at the bottom as in Figure 2.

The machine recognizes that PREP ARTICLE NOUN is a production of PREP-PHRASE, so it places a new node on top of the tree as in Figure 3.

The machine continues to build upwards this way until it reaches a single node at the top of the tree (SENTENCE). Bottom-up parsing is said to be data driven because it attempts to build a derivation with whatever words it finds, rather than expecting to find a particular class of word.

The top-down strategy is faster and simpler, but the bottom-up strategy has more power and reduces backup.

The primary shortcoming of top-down parsing has been mentioned already: the parser is stalemated if it can't make a prediction about the next word. Consider these sentences which are common in programming languages: CMP VAR CMP VAR,X

In the first sentence, VAR functions as a complete operand (analogous to a noun). In the second sentence, VAR modifies X (analogous to an adjective). Each of these sentences requires a different ACTION and NEXT for VAR, but the only way for the parser to choose is to peek ahead at the word that follows VAR. Will it be a comma or a carriage return?

In other words, the parser must use the bottom-up data-driven approach for a moment. Switching from top-down to bottom-up and back again is an accepted parsing technique when the application justifies the extra programming. Human beings probably read in this fashion. They expect certain standard constructions (top-down) and take a second look (bottom-up) only when their expectations aren't satisfied. Most artificial intelligence programs operate on this principle. When incoming data don't match the expectations of the program, the program looks for an alternate explanation that does fit.

Of course, a top-down parser can make an arbitrary choice without looking ahead, so long as it is prepared to back up if the choice proves incorrect. Sometimes backing up is unavoidable, but it is an expensive strategy. The automation must save a copy of its variables whenever it faces an ambiguous word in case it needs to back up and test the other options. Another method of backup is to start a new set of variables (in addition to the original set) and to keep both sets up-to-date until a firm decision is possible. With this strategy, each ACTION must be prepared to start up and abandon extra variable sets also.

Backup becomes especially important when a language allows conversational variations. Conversational usually means that an error or ambiguity will be resolved by other words in the sentence. When the automation finally reaches the words that resolve the problem, it will need to back up and do some reinterpretation.

Since bottom-up parsing checks the input against every word class in the language, rather than only those it expects, the bottom-up approach automatically distinguishes between spelling and syntax errors. A top-down parser can do this also, but only with extra effort.

Bottom-up Automatons

A bottom-up automaton consists of:

* A table of all the productions in the language.

* A model of the sentence.

* A loop that compares the model against the table. The initial model is the sentence itself: The boy ran down the road.

The loop compares the model against the table and recognizes the following productions: ARTICLE [right arrow] the VERB [right arrow] ran PREP [right arrow] down NOUN [right arrow] boy NOUN [right arrow] road

So the loop rewrites the model by substituting: ARTICLE NOUN VERB PREP ARTICLE NOUN

The loop compares the updated model against the table and finds another matching production: PREP-PHRASE [right arrow] PREP ARTICLE NOUN

So the loop substitutes again: ARTICLE NOUN VERB PREP-PHRASE

This procedure continues until the model has been consolidated into the top node of the derivation (SENTENCE). Presumably, the loop will call the proper action subroutine each time it applies a production to the model.

Finite vs. Stack Machines

The model introduces a new mechanism: dynamic storage of information. The model is dynamic in the sense that its size and contents change in response to the needs of the moment. Automatons are called stack machines if they maintain dynamic information of any kind. Otherwise they are called finite machines. While it may be a misnomer, finite is intended to suggest static information that occupies a finite section of memory. Stack refers to the stack upon which dynamic information is stored.

For those who are unfamiliar with stacks, consider the analogy of a stack of plates. You can add or remove plates to or from the top of the stack whenever you wish. If you remove a plate, you get the one that was added most recently. Adding is called "pushing onto the stack" (suggests the spring-loaded stack of plates in a cafeteria), and removing is called "popping from the stack." In a computer, a region of memory is named "the stack," and a variable is initialized to point at the bottom boundary of the stack. Pushing is simulated by moving the pointer one location and then storing the datum where the pointer is pointing. Popping is the reverse operation. Thus, a stack is dynamic storage. It can be nearly full one moment and nearly empty the next.

Figure 4a illustrates how a bottom-up parser uses the stack to build and manipulate a model of "The boy ran down the road." The stack begins empty. The machine pushes nodes onto the stack for each of the words in the sentence and then consolidates the nodes into successively higher productions. Since the machine moves through the sentence from left to right, the rightmost node will always be on top of the stack. The stack pictures in Figure 4a are a movie of the progress of the machine up the derivation tree of the sentence. Most machines use small integers to represent the nodes. We will tank about Figure 4b in a moment.

Parsers can store other information on the stack besides a model of the sentence. We have hinted at the backup problem already. The parser must keep extra copies of its variables until an ambiguity has been resolved. A stack is the perfect storage depot for this dynamic information. As each ambiguity is resolved, the parser can "forget" about it by popping it off the stack.

The stack can be used to delay action subroutines when the meaning of a word is unclear (e.g., VAR in our earlier example). Whereas a finite machine calls each ACTION immediately after a certain word is recognized, a stack machine can keep a list of pending actions on the stack until the full meaning of an ambiguous word is known.

A parser needs a stack to handle recursive constructions like the parentheses in arithmetic expressions. No matter how many levels of parentheses have occurred in the sentence already, the expression can be enclosed in one more set of parentheses. Each new level forces the parser to suspend its current operation for a moment, so it needs a place to store information about the suspended levels until it has finished with the lower ones.

Finite machines live for the moment. They have no queueing ability. Stack machines can have queues, which adds another dimension to the expressive power of a language.

How Stack Machines Work

As Figure 4a illustrates, a bottom-up parser uses the stack to show what part of the derivation has been recognized already. Initially the stack is empty because no nodes have been recognized. As the machine examines each word from the sentence, it asks itself: "Can this word be added to the nodes already on the stack--as part of a production from the language?" If not, the sentence contains a syntax error. If so, the machine pushes the node of the word onto the stack. If this latest node happens to complete a production, the machine pops the entire production and pushes the appropriate higher node in its place.

In Figure 4a, the final NOUN (road) completed a PREP-PHRASE. Therefore the machine popped the entire phrase and pushed PREP-PHRASE in its place. Because PREP-PHRASE completed a VERB-PHRASE, the machine did another pop and push, and so on. The parse is complete when the machine has consolidated the contents of the stack into the top node of the tree (SENTENCE). Hopefully, the sentence will be exhausted at the same moment.

In top-down parsing, the procedure is reversed. Figure 4b shows a prototypical top-down parse of "The boy ran down the road." This time, the stack shows what is needed to complete a valid derivation. The stack is initialized with the top node of the tree (SENTENCE) because initially an entire sentence is needed to complete the derivation. As the machine examines each word from the sentence, it asks itself: "Is this word part of the derivation fragment on the stack?" If not, the sentence contains a syntax error. If so, the machine updates the stack to show which node will be needed after this one for a complete derivation. In Figure 4b, the first article (the) caused the machine to pop SENTENCE and push ARTICLE NOUN VERB-PHRASE in its place. Then "the" and "boy" caused the machine to pop ARTICLE and NOUN, leaving only VERB-PHRASE on the stack. "Ran" causes the machine to pop VERB-PHRASE and to push VERB PREP-PHRASE, and so on. The parse is complete when the stack is empty (when nothing else is needed to complete the derivation). Hopefully, the sentence will be exhausted at the same moment that the stack is emptied.

A bottom-up machine is synthetic. It pushes a node for each word onto the stack and combines them into higher nodes until it achieves a single SENTENCE node. A top-down machine is analytic. It splits upper-level nodes into lower ones and then pops them off the stack as it matches them against words from the sentence. In both cases, the stack is available for other purposes such as pending actions, recursive productions, backup, etc. As we said earlier, the configurations through which the stack passes during a parse are a movie of the trip the machine makes up or down the derivation tree of the sentence.

To direct the continual updating of the stack, the machine has a control table. Each row in the table represents one of the stack symbols (nodes), and each column represents one of the input symbols (word classes). Each intersection of row and column contains the name of a subroutine that will perform the appropriate operation on the contents of the stack. A representation of the control table appears as Figure 5.

The parsing loop consists of:

* Identifying the input symbol for the current word (by applying spelling rules).

* Looking up the intersection of the input symbol with the stack symbol that is currently on top of the stack.

* Executing the op-routine whose name is stored at the intersection.

Some op-routines are error routines that represent illegal intersections of input symbol and top stack symbol. At least one op-routine must be a "parse-is-complete" routine that causes the parser to exit.

For the rest of this article, stack symbols will be capitalized and enclosed in brackets: [NOUN]. Input symbols will be in lowercase: [noun]. We will use a special stack symbol called [EMPTY] to indicate the bottom of the stack.

Top-down Stack Machine

Figure 6a is the control table for a top-down parse of "Tom saw the dog." Admittedly, this table will parse only a few simple sentences from natural English. Figure 6b is the stack movie. You should read each row of the movie this way: "WORD is an INPUT SYMBOL. The intersection of INPUT SYMBOL and TOP STACK SYMBOL is OP. After OP has been executed, the stack will contain STACK CONTENTS."

Applying this to the first row of the movie, we would read as follows: "'Tom' is a [noun]. The intersection of [noun] and [SENT] is OP1. After OP1 has been executed, the stack will contain [NOUN] [VERB] [NOUN-PHRASE] [EMPTY]."

The movie shows that both OP1 and OP5 are required to process "Tom." OP1 splits [SENTENCE] into a production beginning with [NOUN], and then OP5 pops [NOUN] off the stack. Stack machines often require more than one operation to process a single word.

Notice that the intersection of [EMPTY] and [period] causes the parser to exit. If [period] doesn't arrive exactly when the machine expects it, a syntax error will result.

Somehow the machine must generate calls to action subroutines. Each op-routine can call an ACTION itself, or the op-routine can bury an action symbol somewhere on the stack. When the action symbol rises to the top of the stack, the machine will call the corresponding action subroutine. By burying the action symbol, the op-routine can delay the action until more of the sentence has been processed.

The machine can store the translation of a word (or a pointer thereto) on the stack. This allows the translation to be altered several times before it is used. Compilers usually evaluate an arithmetic expression by burying the translation of the first word of the expression on the stack and then updating the translation as each subsequent word in the expression is processed. At some moment during the parse, the stack might look like this: [ARITH-OPERATOR] [VARIABLE-NAME] [action symbol] [translation] [EMPTY]

As [ARITH-OPERATOR] and [VARIABLE-NAME] are recognized and popped off the stack, [translation] is updated. Now the stack will look like this: [action symbol] [updated translation] [EMPTY]

The machine calls [action symbol], which outputs [updated translation] and clears the stack of everything except [EMPTY]. Throughout all of this, the machine must not become confused by the various types of data on the stack (nodes, actions, translations, etc). Usually each routine knows exactly what to expect on the stack. In more complex cases, the machine might need a convention for data types, such as setting the high bit or odd vs. even.

Bottom-up Stack Machine

The control table and stack movie for a bottom-up parse of "Tom saw the dog." are shown in Figure 7. To simplify the illustration, some intersections in the control table have been left blank.

OP1 is one of those op-routines that must search the stack for a complete production after it has pushed a node. In the case of "Tom," the stack contained only [NOUN] [EMPTY] after the push. Therefore, the search of OP1 for a complete production failed. In the case of "dog," the stack contained [NOUN] [ARTICLE] after the push, so OP1 popped them both and pushed [NOUN-PHRASE] in their place. This search-and-replace operation is the method by which a bottom-up stack machine consolidates lower nodes into higher nodes.

To streamline the search-and-replace operation, the machine can have alternate stack symbols for the same node. Each symbol will represent not only itself but also the symbol immediately below it on the stack. Suppose a language includes these productions: [A] [right arrow] bc [E] [right arrow] fc

The machine can have two alternate symbols for c, such as [C1] and [C2]. The machine will push [C1] if the stack already contains [B], but it will push [C2] if the stack already contains [F]. This way, the top stack symbol will always be an encoding of everything below it, and op-routines won't need to look any deeper to know which production (if any) to use. This method increases the size of the control table (several rows for the same node), but the machine executes faster.

The search-and-replace operation allows a bottom-up parser to postpone decisions. In the case of an ambiguous word (e.g., VAR in our earlier example), the machine can push an ambiguous symbol such as [OPERAND-OR-MODIFIER]. Then the machine continues processing words until the stack contains enough information for the search-and-replace operation to select the correct production. The only limitation is that the machine can't push so many ambiguous symbols that the op-routines must become parsers themselves.

Bottom-up machines can use a similar technique to recover from syntax errors. Suppose the sentence was: "The light xyz shining." When the machine encounters the error (xyz), it pushes a special symbol, such as [UNKNOWN], and keeps going. Later, when the op-routine of [GERUND] attempts its search-and-replace operation, it will notice [UNKNOWN] on the stack. Rather than print an error message, it can ask:

DID YOU MEAN: The light "is" shining?

We have already discussed the disadvantages of backup during a parse. Most artifical languages are designed carefully to avoid backup. They are called deterministic because there is only one valid choice at each location in the sentence. Even if the language isn't totally deterministic, a bottom-up stack machine can avoid most backup by postponing decisions (pushing ambiguous nodes) and performing the search-and-replace operation when more information is available.

Recursive Machines

Another method of parsing uses subroutines that can call each other recursively. Each subroutine corresponds to the left side of a production. Suppose the language consists of four productions: [SENT] [right arrow] a[B] [SENT] [right arrow] c [B] [right arrow] a[SENT] [B] [right arrow] d This language could be parsed by the program outlined in Figure 8a. Figure 8b shows the two recursive subroutines in the program, and Figure 8c shows the events in a parse of "a a c." The machine is recursive because each subroutine can call itself via the other subroutine. Recursive parsing is relatively easy to code and debug, but it consumes quite a bit of time and memory. It is a camouflaged stack machine because it needs a stack to store the RTS (return) linkages between the subroutines as they call each other. It is a top-down machine because each subroutine expects certain specific words to appear next in the sentence.

Formal Theorems

We have discussed parsing machines from an intuitive point of view. Languages can be categorized formally according to their production rules. Each category is parsed most effectively by a different configuration of tables and routines. If you are interested in a rigorous discussion, the following reference is more readable than most: Compiler Design Theory; Lewis, Rosenkrantz, Stearns; Addison-Wesley Publishing, 1976.