Optical Character Recognition Done Right
BY DAVID PLOTKIN
Optical character recognition (OCR) is a technology that has come a long way in the last few years. Essentially, it lets you move a document from paper into a computer without having to type. The document is first scanned to create an image. Unfortunately, this image can't be edited using a word processor, because the file is just random pixels to the computer. Enter the OCR. It attempts to recognize the individual letters in the image, SO that the file can he saved as ASCII text. OCRs work via pattern recognition and require large amounts of RAM and expensive software to work.
Optical character reader
512K, Geniscan Hand Scanner
OCR software for the ST
Datel's Readpic offers true optical character
recognition (OCR). You first scan a character;
Readpic then translates it into ASCII text,
making it readable in any ST word processor
or desktop publishing program.
Readpic To The Rescue?
Readpic, from Datel Computers, uses a very clever algorithm to implement OCR on the ST for any compatible image file. Compatible image files include DEGAS screen files (32K), STAD (a European format, seen in Megamax's Sketch) and IMG files. The files can come from anywhere, but Readpic lets you scan directly in any of these formats using Datel's Geniscan ST. Having the scanning function built in is very handy.
Readpic has two basic modes: Recognize and Learn. To use Recognize mode you must first load a file, then load a font. Activate Recognize mode to tell the program to translate the image file into text. This can take a while and there is no way to interrupt the process if you notice that the recognition is not very good. When the image has been translated, you can save it as ASCII text. Any characters that were unrecognized are represented in the file as tildes (~) and will have to be added manually with a text editor.
You can also move through the file using Readpic, filling in the characters yourself as you go, before saving the file. A series of onscreen buttons let you move left, right, up and down through the file. It you find a character that has been incorrectly defined, select the Redefine Symbol to correct the problem. The "special" button gives you access to the entire character set (including characters not available from the keyboard). With this function you can select the proper character with the mouse. Quick Search will find the next unrecognized character.
How well the program recognizes characters depends Ofl many things, including how carefully the scan was made (it should be straight horizontal or vertical), how well formed the letters are and how much random "noise" there is along the edge of the letters. It also depends on how closely the font you are using matches the one in the scanned image. For an image file based on the correct font, the recognition rate can often be 100 percent.
If the recognition rate is poor, even though the scan quality is good, the problem is likely to be that the font in the scan doesn't match the loaded font very closely. To solve this problem, Readpic's Learn mode comes into play. In Learn mode, you step through the text, "teaching" Readpic what each character is. In doing this, you are defining the font SO that future scans based on this same font will be more successful. You can define two or more different patterns to he the same letter, to take into account. for example, hold and italics. Once completed, the font can be saved to disk.
There are several parameters that you can adjust in Readpic to increase the recognition rate. Readpic works on a clever but simple principle. Three lines are defined across the bottom and five lines up the side of a character. Basically, the character is recognized by the number of times that it crosses each line. Clearly, if a line's location is near an edge with random noise in the digitization, you'll get inconsistent results. You can also set spacing horizontally and vertically so that the program consistently finds the letters. An editor lets you clean up random pixels in the scan.
The Readpic manual, though translated from German, is good, and suffers from little of the confusing phrasing common to such efforts. The program does refuse to run with Double Click's DC Deskey. Readpic seems to work and could be quite a time saver for people with lots of text to get into their computer. With a little time and effort, you can soon convert digitized images to text easily.