Learning With Computers: The Computer Speaks, But Will It Listen?

Learning With Computers

Glenn M. Kleiman

The Computer Speaks, But Will It Listen?

Computer-generated speech, already used in some software, will be incorporated into many educational programs in the next few years. Spoken instructions and responses will be used in programs designed for prereading children and for students who have reading difficulties. Speech will be an integral component of programs which help students learn reading, spelling, and foreign languages, and will make many other types of educational programs more interesting and enjoyable.

Computerized speech can open new worlds for handicapped people. Special programs enable blind users to direct a speech synthesizer to read aloud the words on the computer screen. This makes computerized information bases, word processing, programming languages, and many other computer tools available to the blind. Computerized speech can also help provide communication aids for people with speech impairments.

Computerized speech recognition devices are also becoming less expensive and more readily available. These enable computers to recognize words people say, and can make programs easier to use and more appealing. More importantly, speech recognition devices make computers accessible to many people who have physical handicaps which prohibit them from using keyboards.

Two Types Of Computer Speech

There are two general types of computer-generated speech: stored vocabulary and unlimited vocabulary.

Dr. Glenn M. Kleiman is an educational psychologist and software developer. He is the author of Brave New Schools: How Computers Can Change Education (Reston/Prentice-Hall) and the designer of Square Pairs, an educational game program (Scholastic, Inc.).

Stored vocabulary speech is created by a person saying the words. Special devices and programs measure characteristics of the sound waveform (for example, intensity, pitch) as the person pronounces each word. Numbers representing the waveform at each fraction of a second are stored in the computer. That is, the speech waveform (an example of what is called analogue information) is converted to a sequence of numbers (digitized information). The numbers are then used to recreate the sound of the word whenever it is needed.

Stored vocabulary speech can sound very human when individual words are produced. However, it usually sounds choppy and some-what artificial when the words are combined into sentences. With this technique, the computer is limited to the words previously stored in its memory.

Each digitized word requires a large amount of memory—many numbers must be stored for the computer to recreate the spoken words clearly—so the vocabulary of a personal computer with digitized speech is limited. However, the possibilities for digitized speech will expand as larger-capacity computer memories become less expensive, and as more efficient techniques are developed for representing speech waveforms within the computer's memory.

Unlimited Vocabulary

With unlimited vocabulary speech, programs for generating the individual speech sound (phonemes) are stored in the computer, along with the rules for combining them into words, phrases, and sentences. This technique of speech synthesis enables the computer to produce any word from its component sounds. Synthesized speech does not sound as natural as digitized speech, but it has been greatly improved in recent years.

Phoneme synthesis techniques have been combined with text-to-speech conversion programs. These programs contain a set of rules which tell the computer how to change any sequence of letters into speech. Creating a program of this sort for English is difficult, since many letters and letter patterns are pronounced in various ways, depending on the context of their use. For example, the word read is pronounced differently depending upon whether it refers to the past or future (for example, John read the book versus John will read the book). The same aspects of English which cause difficulties for people in learning to read also cause difficulties in programming computers to convert written English to spoken English.

While text-to-speech programs do not produce human-sounding speech, most people understand it easily after a short time—much the way we can understand someone who has a foreign accent and mispronounces some words. Text-to-speech is valuable for people with impaired vision. However, it is not suitable for educational applications in which clear speech is essential.

A Talking Apple

The Echo II speech synthesizer, for Apple II computers, makes use of both stored and unlimited vocabulary techniques. The Echo II is a board that plugs into a slot in the Apple. A speaker or headphone then plugs into the board. The board has volume and pitch controls, but these can also be controlled from software. The basis of the Echo II is a speech synthesis chip made by Texas Instruments. This chip, an advanced version of the one used in the original Speak and Spell toy, is used in most of the available speech synthesizers.

The Echo II comes with a text-to-speech program. It also allows you to enter speech more directly by using symbols to represent each sound (for example, there are different symbols for the long e of Pete and the short e of bet). In addition, a disk containing 700 digitized words is available. These provide a good demonstration of the superior quality of digitized speech.

With the Echo II, it is easy to add speech to your own program. You can change the volume, pitch, and rate of speech, all under the control of your program. Produced by Street Electronics, the Echo II sells for about $150. Street Electronics also produces speech synthesizers for the IBM PC and for other personal computers. Other speech synthesizers are available, including Type-'N-Talk from Votrax, Mockingboard from Sweet Micro Systems, and S.A.M. from Don't Ask Computer Software.

Computers That Listen

A great deal of research has been devoted to getting computers to recognize people's speech. This research has shown that speech is very complex and that we do not fully understand how people are able to recognize spoken words so easily. It is much more difficult to make computers recognize spoken words than it is to make them pronounce words. However, advances have been made and some usable, although limited, devices are now available.

Current systems for personal computers require the user to program the computer to distinguish among a number of spoken words. The technique is related to stored vocabulary speech. The individual selects a vocabulary to be used. He says each word, then the computer digitizes the sound patterns and stores a set of numbers representing the waveform of the word.

Once trained, the computer recognizes a spoken word by digitizing it and comparing the resulting pattern of numbers to the patterns stored in its memory. Since the pronunciation changes slightly each time an individual says a word, exact matches are not expected, but the computer is programmed to find the closest match. Since people differ widely in their speech patterns, these systems are reliable only in recognizing the words spoken by the person who spoke the original training set.

The digitized representation of each word uses up a lot of computer memory, and the matching process becomes progressively slower and less reliable as more words are added. Therefore, speech recognition systems work well only with limited vocabularies.

It Takes Dictation

One speech recognition device is the Voice Entry Terminal (VET-2), produced by Scott Instruments for Apple II computers. The VET-2 can be programmed for sets of up to 40 words. The Apple II can hold only one set in memory at a time, but others can be loaded from disk as needed.

One important characteristic of the VET-2 is that it functions as a keyboard emulator. It plugs into the computer in parallel with the keyboard, so both can be used together. Each spoken word is associated with a string of printed characters.

When the spoken word is recognized, the VET-2 sends the same signals to the computer that the keyboard sends when the associated keys are pressed. Therefore, you can have the VET-2 recognize a spoken name for each key and then "type" by saying the names of letters, numbers, and special characters. You can then use standard software with voice input replacing the keyboard.

What About Language?

Current technology for personal computers enables us to have computers speak and recognize individual words. But what about sentences and paragraphs? For speech production, we can have the computer string words together, but replicating the intonation and stress patterns of human voices is another, much more difficult, matter.

For speech recognition, anything more complex than the simplest sentence creates inordinate difficulties. Try listening to fluent speakers of a language you do not understand. Can you even tell where one word ends and the next begins? Recognizing the words in spoken sentences generally depends upon being able to understand meanings, something we have not yet learned to program personal computers to do.

Getting computers to produce and understand language is the focus of much of the effort of researchers in artificial intelligence. They have had only limited success, with very powerful computers. For the present, we will have to be content with personal computers which are at the single-word state of language development.

Street Electronics (Echo II)

1140 Mark Ave.

Carpinteria, CA 93013

Sweet Micro Systems (Mockingboard)

Cranston, RI 02910

Votrax (Type-'N-Talk)

500 Stephenson Highway

Troy, MI 48084

Don't Ask Computer Software (Software Automatic Mouth)

2265 Westwood Blvd

Los Angeles, CA 90064

Scott Instruments (Voice-Entry-Terminal)

1111 Willow Springs Drive

Denton, TX 76201