Classic Computer Magazine Archive CREATIVE COMPUTING VOL. 9, NO. 11 / NOVEMBER 1983 / PAGE 240

The joy of Lex; a program for word lovers. (Lexicon computer program) Thomas M. Paikeday.

If you think the joys of life end with sex, food, and video games, I have news for you. Some people revel in words and even have orgies.

One of the many occupations that microelectronics has revolutionized is the compilation of reference works which require the collection and organization of vast amounts of data gathered from innumerable sources. Traditionally, the editors of catalogs, indexes, directories, thesauruses, encyclopedias, and dictionaries have relied on the card file as their main repository of information.

As is well known, dictionaries are the most widely used of reference books, second only to the Bible in total word sales. They are also the most specialized and complicated of all, requiring data files that consist of millions of 4" x 6" slips called citations illustrating the meanings and usages of words. The citation file of a traditional dictionary publisher such as Oxford University Press or the Merriam-Webster Company in Springfield, MA, is as large as the card catalogs of the largest libraries in the world.

Until a few years ago, computerizing a data file of such a size was prohibitive even for the most prosperous of book publishers. The microelectronic revolution, however, has made it affordable for even the private lexicographer.

One of the latest dictionaries on the American market today is The New York Times Everyday Dictionary, published in the fall of 1982 by Times Books. In the late 1970's when the Times dictionary was commissioned, microcomputers were just making their debut. By the time the dictionary was ready for publication, a complete microcomputer system for the collection and analysis of dictionary data was in service--a world's first for a publication associated with one of the world's greatest newspapers.

This article outlines the various functions of that microcomputer system, the Lexicon program, as it has been called. It should prove of interest not only to lexicographers and linguists but to a wide spectrum of authors, editors, indexers, and word lovers--practically every one concerned with any type of English composition or language analysis.

Except for the file creation routine which is written in Basic, the rest of Lexicon is in machine code, in two separate modules, each taking up less than 5K of memory. Most routines are executed by the computer at an average reading speed of about 25,000 words a minute. This is using a computer with only 48K of internal memory working at about 2MHz clock speed like the first machine we used on our project, a TRS-80 Model I. With larger and faster machines using hard disks, the speed is considerably higher. Preparing The Bed

A word processing program such as Scripsit working with 48K of memory can handle online files of only up to 5000 words--too small for a dictionary data file or even to hold a book of some size as, say, Joy of Cooking or The Thorn Birds (250,000 words). Lexicon is designed for book-length files, in fact, for files of indefinite length.

The first routine of Lexicon creates a continuous "mainfile" on the disk drives connected to the computer. The mainfile is continous in the sense that when a global search is commanded, say for a word count of the entire online file, the computer starts with Drive 0 and reads to the end of the last drive connected to it. An uninterrupted search of one megabyte of data (about 200,000 words) takes from eight seconds to eight minutes, depending upon whether the information sought is at the beginnning or end of the mainfile.

The creation of a mainfile itself takes about ten minutes to go through four 40-track double density drives--the floppy system we found most practical for everyday use. Once a set of disks is thus prepared, they can be used as masters to copy from relatively quickly for making more mainfiles for storage and analysis of different texts.

Texts are typed in using a word processing program or received from other databases such as CompuServe and online newspapers using a terminal program included in Lexicon. Lex With Ancients and Moderns

When we work on books that have already been typeset, instead of keying in the whole text, we buy the magnetic tapes used for their composition. The major works of world literature, ancient and modern, in languages from Chinese to Greek and Latin to Tibetan and Turkish, are available in machine readable form to scholars at nominal cost from the computing centers of universities such as Oxford and Cambridge. All such tapes can be put on the tape drive of a mainframe computer and converted for use on your micro through a hardwired connection to the RS-232 interface.

Since text entered in a mainfile cannot be changed after loading, all editing is done before input. Again the word processing program comes in handy for routine editing, but only Lexticon can be used for inserting bibliographies, footnotes, etc. in the texts going into a mainfile.

Insertions are made by calling up the desired file on the screen from its disk storage, typing in the new material at the appropriate places in the text, and entering it after inserting boundary markers before and after it. Insertions must not be longer than one line each, or 64 characters and spaces. Such inserted lines are displayed by the computer at the bottom of the screen without being read as part of the online text (Figure 1).

Once your texts have been edited, they may be loaded from the buffer into mainfile one after another with a single keystroke. The Lex Drive Or Seven On The Floor

Using the first of seven retrieval functions of Lexicon, any string of characters and spaces to a maximum length of 64 (including the command word FIND) may be searched for in the mainfile (Figure 2). Variations in the spacing used before and after the input string yield four possible variations on what is retrieved. In the following commands # stands for space:

1. FIND # # (string) # (ENTER)

2. FIND # # (string) (ENTER)

3. FIND # (string) # (ENTER)

4. FIND # (string) (ENTER)

This is designed chiefly to get out dictionary data based on word formations. Thus, if you wanted form, formal, formed, former, formula, and form and substance, you would key in "Form" using the second command above. If you wanted inform, perform, transform, and uniform, you would key in "Form" using the third command.

But the chief purpose of the FIND function in dictionary-making is to have large data banks of English texts belonging to various genres online to tap into for evidence of new words entering the vocabulary. The traditional method using a card file has been shown to be erratic in its results. In the just-published ninth edition of the best-selling Merriam-Webster Collegiate Dictonary, for example, which is supposed to reflect the new vocabulary of the decade since the eighth was published in 1973, relatively current terms such as computerist, bargaining chip, baby boom, fast lane, spreadsheet, X-rated, boardsailing, and checkbook journalism are not entered, while such rare and obsolescent ones as computernik, downsize, and white flight are.

In common use, the FIND function of Lexicon comes in handy for locating related words, as Reagan, Reagan's and Reaganomics in the text that is being analyzed. Since the stress of the Lexicon program is on the lexical or meaningful aspect of a word, the FIND routine is geared to finding words and phrases irrespective of spelling variants involving capital and lowercase letters, as in XEROX/xerox, MacDonald/Macdonald, and However/however. It also disregards punctuation marks and symbols which the computer simply reads as spaces.

The found word or phrase appears flashing in the center of the screen surrounded by text with its relevant bibliography or footnote displayed in the last line. If more context than is on the screen is desired, the text may be scrolled forward or backward to either end of the file using the up/down arrow keys. A printout of what is on the screen at any time may be obtained by pressing a specified key.

To get the next occurence of the word or phrase being searched, the ENTER key is pressed and held momentarily. This process may be repeated until the end of the online file is reached and the computer returns to the READY prompt. The BREAK key is used to interrupt the search at any time and FINd another word or phrase.

Using the PHRASE function, you can locate strings that are discontinuous and variable, to a maximum length of 32 characters and spaces, within initial and final characters that must be specified in the command. Thus, by keying in "g" followed by "together" you could search for variants of a phrase such as set togeither, get it together, ge it all together, get himself together, get your act together, and got it together (Figure 3). Or, by keying in "la" followed by "claim" you could find occurrences of lays claim and laid no claim. Leaving no space after "claim" will net occurrences such as laying claims.

A popular use of this function would be to locate portions of text containing two specified key words as, say, poach and fish, thus narrowing your search (Figure 4). Your Lex Count

This function is used to get a count of the total number of times a specified string occurs in the online file in the free, prefixed, suffixed, or infixed position. Various literary and linguistic frequency conts may be made using this function, from how many times the first person singular is used in a text to how much more frequent is the spelling sequence ei than ie to whether quantum jump is more frequently used in current English than quantum leap. The COUNT commands are varied, using the same kind of spacing as the FIND routine.

However, the most popular use of this function will be to count the total number of words in a book or text. For this purpose, a space is used in the command instead of a string and the resulting count plus one will be the total word count since all spaces between words were reduce to one each in the course of loading into mainfile (Figure 5). The Lex Dance

Our concordance is a lining up of identical strings of characters in a text with a few words of context preceding and following each. It is the easiest device for taking a quick look at a booklength text from various angles based on words or for polling writers represented in a file on points of grammar and usage relating to particular words.

Are commas being used consistently in the text? Is there a distinction between bibliographic and bibligraphical with regard to usage if not meaning? In what syntactical settings is different follwed by from, than, or to? Is between used only to show relationships between two terms--is it always among more than two? Does the consensus favor If I were over If I was for conditions contrary to fact? All such questions can be polled by commanding a concordance of the key words or phrases involved; that is, if your file is composed of carefully selected texts (Figures 6, 7).

In ordinary use, the concordance function helps editors and indexers to line up the key words of a book or text and print out a master list which can then be used for closer study of the text using the FIND function (Figure 8).

Concordances are prepared in sets of 78 lines 128 characters wide. You must scroll right and left to read either end of the 64-character display. Pressing the CLEAR key provides a printout of a concordance after which the computer automatically proceeds to prepare another concordance from the next portion of the file. The BREAK key is used to command a new concordance with a different string.

As in the FIND routine, the commands may be varied in four ways to draw up a concordance of the particular string you want: in the free, prefixed, suffixed, or infixed position. Your Lex Profile

Alphabetization and the three remaining routines proceed at about 1000 words a minute, perhaps the fastest at which so much sorting and merging can be done by a micro of less than 2MHz clock speed. If that is not fast enough, consider that your workhorse will never take even a coffee break, and the work will be done and waiting for you when you return from lunch.

Alphabetization is useful not only in vocabulary studies but also for reading proofs of books that have been typeset. To catch typos and such spelling errors, just check the aplhabetical printout for unusual forms and then comb the file using the FIND routine for their exact locations.

Another routine that is based on alphabetization is Comparison of Vocabularies, a routine which enables you to check two files against each other and print out the differences in vocabulary.

A third routine that is very useful for students of vocabulary as well as dictionary makers is frequency-ranking. You can rank the, of, and, to, a, in, is, you, that, it, or whatever, depending on the vocabulary used in the text being studied. a ranked list gives a profile of the writer's diction and reveals many other facets of the English language that are of technical interest to the linguist or lexicographer.