A poor man's spelling verify program. Charles D. Arceneaux.
A Poor Man's Spelling Verify Program
Some people think that a spelling verify feature in word processing is an unnecessary luxury. These are people who seem instinctively to know how every word in the English language is spelled. But for thick skulls like me, a proofreading aid is essential. Unrortunately, thick skulls like me don't always have an understanding spouse--after spending all that money on a letter quality printer, what's left over had better go for groceries.
The solution? Simple. I have Extended Disk Basic on my Micromation computer. Why not write my own? My confidence in my ability to write my own is augmented by the fact that I was on a team that wrote a spelling verify feature for a mini-computer manufacturer. Based on my experience there, I had a good mental picture of what an interactive spelling verifier should look like. Here are the essentials:
The dictionary is the most important part of a spelling verifier, so don't be surprised if most of this article is on this subject.
It need not contain all the words in the English language, just the words in the user's vocabulary. As such, it must start out small and be able to grow as it is used. It must include all forms of the words in it--forbid, forbidden, forbidding, forbade, and, of course, forbids. Funk and Wagnalls also includes "forbidal' which is not in my vocabulary, so it should not be included in my dictionary.
Some dictionaries list each form separately, others contain coded instructions that allow reconstruction of each form. There are obviously advantages to each scheme. The method to be used must be worked out in the design phase of the project, and the decision should be made based on the access methods that are available.
A format that allows all forms of a word in the same entry goes like this: forbid,DenDingS;
The words must only be reconstituted if there is a match in the first part of the word. I have seen this scheme used on a system that had only one file type, that is a page-oriented serial or document file. The verification was done on an entire page of text at a time. The page was held in memory and accessed randomly while the dictionary was read through serially.
Since it skipped around the page while checking words, the program waited until the whole page had been checked before feeding back to the user the words that were wrong. I thought this program was awkward to use, but since it was one of the first verify programs on the market, I can't be too critical of it.
If a truly random file, (called a direct file on some systems and an indexed sequential file on others) is available, then the problem of designing a dictionary is simplified. With one entry for each word, you simply set each word up as a key and do a random seek. If there is a valid read on the file, then the word is good; if an invalid read is detected, the word is bad. This method is much more straightforward because the document can be scanned serially while the dictionary is accessed randomly. It does present another problem though.
This problem is that most direct files rely on fixed length keys. This means that the file length will be the maximum word length multiplied by the mumber of word entries. A typical file of 50,000 words with a maximum word size of 25 would take 1,250,000 bytes of file space, plus whatever overhead the access method imposed. In my mind, this is not practical for personal computer users.
Fortunately, there is an easy solution to this problem. That is to segment the dictionary by word size. Thus, you could have one dictionary for short words and another for longer words. You could even go all out and have 24 dictionaries, one each for words of length two through 25. And of course there are several intermediate divisions you could use.
Another breakdown of the dictionary can be by frequency of use. Some words in the English language are used significantly more often than others. For example, the word "the' is used over 200 times in this article. On the other hand, the word "marrowminded' almost didn't get used at all.
It would not be very hard to put together a list of the words with the highest probability of use. I have seen lists as large as 5000 words, but we would not have to go that far. The list in Figure 1 represents almost one third of the words I use in my writing. If this list were kept in memory and checked first, it would save one third of the disk seeks.
There are other sorts of words that either must or should be kept in memory in a spelling verifier. One of these is temporary words, that is, words that are peculiar to the document being verified. This would include proper names, such as the addressee of a letter, but it might also include abbreviations. Other choices for a word memory file are words that have already been looked up once and words that must be added to the dictionary.
Most punctuation is ignored, but there are two punctuation marks that must be dealt with in a spelling verifier. These are the apostrophe and the hyphen. The easiest to deal with is the apostrophe. One way to treat it is to squeeze it out of words. This would mean, for example, that "miners' and "miner's' could both be represented by the same dictionary entry. It would also mean that "couldn't' would be represented in the dictionary as "couldnt.' The alternative, of course, is to consider the apostrophe the same as any letter of the alphabet. Personally, I prefer this approach because it allows a more precise dictionary.
The hyphen was a bigger problem than the apostrophe because it has several meanings. One is as a minus sign. Another is to punctuate a sentence--like this. A third is to divide a word that extends over a line. A fourth is in joining words, such as "user-friendly.' Only the last two of these are important in spelling verification.
The third form, dividing a word, often has a separate character code associated with it in a word processor. You can use this feature to rejoin words that have been divided by hyphens. The fourth form, on the other hand, presents a big problem.
Most words that contain a hyphen as part of the spelling, such as our example "user-friendly,' are made up of roots that are valid words in their own right. As a result, if this form of the hyphen is ignored, many longer compound words can be eliminated from the dictionary. This does not apply to all words; one exception is "cul-de-sac.' In other words, either way there is some problem. I prefer ignoring the hyphen used this way, because it makes it easier to ignore forms one and two.
When a word is not found in the dictionary, it must be flagged as incorrect, and the user must be given several choices as to what to do. The possible choices are:
1) to correct the word in place; 2) to correct every occurrence of the word; 3) to add the word to the temporary dictionary for use on the rest of this document; 4) to add the word to the temporary dictionary for later inclusion in the permanent dictionary; 5) to add the word to the permanent dictionary immediately; or 6) to ignore the word. All of these choices eccept the last require extra coding, and thus memory space and other systems resources, to implement. For this reason, it is rare to find all of these features in the same program.
Speed And Space
This brings us to a question that has been haunting programmers and system designers for decades: What compromises should I make with speed, space and features? Since I had chosen Basic in which to write my verify program, I was faced with this problem from the very beginning. It is not that Basic is a bad language, in fact it has many advantages. Chief among its advantages is that it is widely available. Almost every personal computer comes with Basic free. Also, it has features built into it that allow for easy manipulation of string data. The disadvantages of Basic are that it is slow and it requires a great deal of overhead for string manipulation. (Of course, we never see what the overhead is.)
The program is divided into modules and presented in Figures 2 through 14.
The immediate problem was dictionary organization. The Basic I am using (Microsoft Extended Disk Basic) does not have a direct access method. On the other hand, it has more than the serial file method available. Thus, I had to write my own direct access for the alphabetized dictionary. Figure 10 shows the binary search routine used to access this file.
Additional Design Compromises
There is one important feature I have left out of this program that should be in an interactive verifier. That is the ability to make corrections in the source document file. I felt that the extra file handling and logic would make the program so slow and unwieldy as to be completely useless.
I avoid dividing words like the plague, so I have left all word divide hyphens alone. If you prefer to divide words in your typing, then you can add this feature yourself.
When an apparently incorrect word is found, you have only three choices with my program. You can press R for record in temporary dictionary, Q for stop processing, or RETURN for ignore the word and proceed. When the end of the file is reached or a Q is entered, you are asked if temporary words should be recorded permanently. This allows you to add to the dictionary as the program is being run.
Building The Dictionary
Rather than sitting down with a dictionary and typing words in, I have been letting the program build the dictionary. On one system I used, we had something like 80,000 words in our dictionary. Even so, every document we verified contained some words that were missing from the dictionary. This meant to me that if I had sat down with a book like Webster's Instant Word Guide and typed in words, I would be putting in a great deal of effort and ending up with a dictionary that still didn't match my vocabulary. So by letting the program itself update the dictionary, I am creating a dictionary that exactly matches my vocabulary.
Of course, I have had to sit and look up a great many words before pressing R for record. You could load the dictionary faster by typing into a document the words you want to load into the dictionary, and then verifying the document. For best efficiency, batch the words in groups of about 500 and modify the program to record each word without operator intervention.
Priming The Program
To prime my program I did three things. First of all, the original program had only a memory file. This file was loaded and saved on disk into a file called WORDS. To start, I created the file WORDS with my text editor to contain the words in Figure 1. After this file had grown to about 1000 words, the memory file was too unwieldy to use. At this point I added the coding for the disk file to my program and ran the program in Figure 15 to initialize it.
There is one significant fault with this verifier. It is slower than the other interactive verifiers that I have used. I could probably have gotten more performance out of the program if I had coded it in some other language. It is true that I do have more time than money, but I would rather put up with the small amount of slowness in this program than try to debug a PL/I version, for example.
An interactive spelling verify program has only limited application, primarily because you have to sit there while it is running. (This is a problem with fast interactive spelling verifiers also.)
Furthermore, there are several spelling mistakes that this kind of program will not catch. I have already started thinking about what the next refinement to this program should be. The problem statement is: without going into artificial intelligence or syntactical analysis, devise a program that will aid in the detection of homonym errors. Examples of homonyms are: die and dye; there and there; whey, way, and weigh. I'll start on it as soon as I've weeded the garden and cleaned the garage.
Table: Figure 1. This table shows the 22 most common words in my own writing and the probability of occurrence of each word. In these words we have almost one third of the words used in any of my writing. This distribution is probably similar to the distribution of words in this magazine, and in the whole body of English text.
Table: Figure 2. This is the set-up. First we define two arrays and fill them up. The first is T$, which is our translate table to make all letters lower case. This is filled in lines 130 to 150 from the DATA statements in lines 160 through 190. The other array is W$. This is filled with short words from the file WORD4.
In this code we also open the main dictionary file and retrieve the value W that is kept in its header. W is the number of the next available file entry and is used in the binary search and file update routines.
Table: Figure 3. Here we ask the name of the document and open it. Notice that there is no error checking. We can ask ourselves if we need to error check at this point. Certainly we can use all the memory space we can spare to store words and text. An error at this point is not catastrophic, so let's just skip it.
Table: Figure 4. This code helps me position myself within the document. If I am checking a letter, this code is not necessary, but when I have to check a story or article, then this code is vital to allow me to check only the parts I need to check.
Table: Figure 5. This little piece of code scans for the beginning of a word. If it does not find an alphabetic character, it loops up to line 350.
Table: Figure 6. This piece of code builds the word in the work buffer, then displays it to the user.
Table: Figure 7. Here we perform a binary search to see if the word is in our memory file. If it finds the word, the program will loop back to find another word. If it does not find the word, it remembers in K where the word should have been. This will allow us to add the word to the memory file later if we want to.
Notice also the trap in line 465. There are words longer than 15 characters in the English language, but I rarely use them. In fact, this trap has never been tripped.
Table: Figure 8. This is where we check to see if the user wants to record the word in the dictionary. There are two interesting things about this piece of code. The first is that we use the INPUT$ function instead of the INPUT command. To respond to this question, the user need only press one key. This is more efficient than having to press the response code and then the RETURN key. We do pay a price, though, in that it is easier for the user to press the wrong key. (Perhaps we should put more error checking here.)
The other interesting thing is that this code is not at a logical place in the flow of the program. This is because Basic does not have a convenient way to move lines. As I made changes in the logic of the program, this code stayed where I had originally put it.
Table: Figure 9. This is where we insert a new word into the memory file. This file originally contains the words from my dictionary that were from two to four characters long. We will store all words in this array, and if we are asked to record the new words, then we will break this array in two by word size, recording each group of words in the appropriate dictionary. (See Figures II through 14.)
Table: Figure 10. This is the binary search of the disk resident dictionary. Notice that it is almost identical to the binary search used on the memory file.
Table: Figure 11. This is the first part of the close routine. The program comes here when it detects an end-of-file on the document or when the user responds with a Q to the misspelled message. In this code the user is asked if the temporary words should be stored, and the short temporary words are written to disk.
Table: Figure 12. The second part of the close routine collects all temporary words longer than four into one place and releases the extra space.
Table: Figure 13. The third part of the close routine is probably the most complicated piece of code in the program. It mirges the long words from the memory file with the dictionary file on disk.
Table: Figure 14. Finally, we write the new dictionary file length into the header record and finish. The program beeps when it is through so that the user can go do something else during the three plus minutes it takes to reorganize the dictionary file.
Table: Figure 15. This program is run to brake a memory file image into a new memory file image of short words and a random disk file of long words for my spelling verify program.