CHI '95 ProceedingsTopIndexes
PapersTOC

Glove-TalkII: An Adaptive Gesture-to-Formant Interface

Sidney Fels & Geoffrey Hinton

Department of Computer Science & Department of Computer Science
University of Toronto & University of Toronto
Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4
ssfels@ai.toronto.edu & hinton@ai.toronto.edu

© ACM

Abstract:

Glove-TalkII is a system which translates hand gestures to speech through an adaptive interface. Hand gestures are mapped continuously to 10 control parameters of a parallel formant speech synthesizer. The mapping allows the hand to act as an artificial vocal tract that produces speech in real time. This gives an unlimited vocabulary, multiple languages in addition to direct control of fundamental frequency and volume. Currently, the best version of Glove-TalkII uses several input devices (including a Cyberglove, a ContactGlove, a polhemus sensor, and a foot-pedal), a parallel formant speech synthesizer and 3 neural networks. The gesture-to-speech task is divided into vowel and consonant production by using a gating network to weight the outputs of a vowel and a consonant neural network. The gating network and the consonant network are trained with examples from the user. The vowel network implements a fixed, user-defined relationship between hand-position and vowel sound and does not require any training examples from the user. Volume, fundamental frequency and stop consonants are produced with a fixed mapping from the input devices. One subject has trained for about 100 hours to speak intelligibly with Glove-TalkII. He passed through eight distinct stages while learning to speak. He speaks slowly with speech quality similar to a text-to-speech synthesizer but with far more natural- sounding pitch variations.

Keywords:

Gesture-to-speech device, gestural input, speech output, speech acquisition, adaptive interface, talking machine.

Introduction

Many different possible schemes exist for converting hand gestures to speech. The choice of scheme depends on the granularity of the speech that you want to produce. Figure 1 identifies a spectrum defined by possible divisions of speech based on the duration of the sound. for each granularity. What is interesting is that in general, the coarser the division of speech, the smaller the bandwidth necessary for the user. In contrast, where the granularity of speech is on the order of articulatory muscle movements (i.e. the artificial vocal tract [AVT]) high bandwidth control is necessary for good speech. Devices which implement this model of speech production are like musical instruments which produce speech sounds. The user must control the timing of sounds to produce speech much as a musician plays notes to produce music. The AVT allows unlimited vocabulary, control of pitch and non-verbal sounds. Glove-TalkII is an adaptive interface that implements an AVT.

FIGURE 1:Spectrum of gesture-to-speech mappings based on the granularity of speech.

Translating gestures to speech using an AVT model has a long history beginning in the late 1700's. Systems developed include a bellows-driven hand-varied resonator tube with auxiliary controls (1790's [15]), a rubber-moulded skull with actuators for manipulating tongue and jaw position (1880's [1] ) and a keyboard-footpedal interface controlling a set of linearly spaced bandpass frequency generators called the Voder (1940 [4] ). The Voder was demonstrated at the World's Fair in 1939 by operators who had trained continuously for one year to learn to speak with the system. This suggests that the task of speaking with a gestural interface is very difficult and the training times could be significantly decreased with a better interface. Glove-TalkII is implemented with neural networks which allows the system to learn the user's interpretation of an articulatory model of speaking.

The obvious use of an AVT is as a speaking aid for speech impaired people. Clearly, the difficulties encountered with this application include the extreme motor demands and time required to learn to use the device compared to other speech prostheses which require less control. Additionally, users must be able to hear to effectively use the device which further limits the potential user group. Of course, care must be taken when considering these criticisms since AVTs potentially provide a much richer speech space than other coarse granularity systems which may be preferable for some people. And, just as children who are learning to speak are willing to spend the large amount of time required to control their vocal tracts, it is not unreasonable to expect users to spend on the order of 100 hours to learn to speak with an AVT like Glove-TalkII. Besides the obvious application of Glove-TalkII, the neural network techniques used successfully can be applied to other complex interfaces where adaptation between a user's cognitive space and some objective space is required, for example; musical instrument design and telerobotics.

This paper first describes the Glove-TalkII system and then the experience of a single subject as he learned to speak with Glove-TalkII over 100 hours. Quantitative analysis of Glove-TalkII only provides a rough guide to the performance of the whole system. Observation of the single subject allows for qualitative analysis of Glove-TalkII to determine its effectiveness as a gesture-to-speech device.

OVERVIEW OF GLOVE-TALKII

The Glove-TalkII system converts hand gestures to speech, based on a gesture-to-formant model. The gesture vocabulary is based on a vocal-articulator model of the hand. By dividing the mapping tasks into independent subtasks, a substantial reduction in network size and training time is possible (see [5]).

Figure 2 illustrates the whole Glove-TalkII system. Important features include the three neural networks labeled vowel/consonant decision (V/C), vowel, and consonant. The V/C network is a 12--10--1 feed forward neural network with sigmoid activation functions Footnote 1:

FIGURE 2: Block diagram of Glove-TalkII: input from the user is measured by the Cyberglove, polhemus, keyboard and foot pedal, then mapped using neural networks and fixed functions to formant parameters which drive the parallel formant synthesizer [00].

The V/C network is trained on data collected from the user to decide whether he wants to produce a vowel or a consonant sound. Likewise, the consonant network is trained to produce consonant sounds based on user-generated examples from an initial gesture vocabulary. The consonant network is a 12--15--9 feed forward network. It uses normalized radial basis function (RBF) [2] ) activations for the hidden units and sigmoid activations for the output units. In contrast, the vowel network implements a fixed mapping between hand-positions and vowel phonemes defined by the user. The vowel network is a 2--11--8 feed forward network. It also uses normalized RBF hidden units and sigmoid output units Footnote 2

As is typical with speech research though, care must be taken when using quantitative analysis of the networks performance to judge the performance of the whole system. For this reason, qualitative analysis of the single user is important.}. Eight contact switches on the user's left hand designate the stop consonants (B, D, G, J, P, T, K, CH), because the dynamics of such sounds proved too fast to be controlled by the user. The foot pedal provides a volume control by adjusting the speech amplitude and this mapping is fixed. The fundamental frequency, which is related to the pitch of the speech, is determined by a fixed mapping from the user's hand height. The output of the system drives 10 control parameters of a parallel formant speech synthesizer every 10 msec. The 10 control parameters are: nasal amplitude (ALF), first, second and third formant frequency and amplitude (F1, A1, F2, A2, F3, A3), high frequency amplitude (AHF), degree of voicing (V) and fundamental frequency (F0).

Once trained, Glove-TalkII can be used as follows: to initiate speech, the user forms the hand shape of the first sound she intends to produce. She depresses the foot pedal and the sound comes out of the synthesizer. Vowels and consonants of various qualities are produced in a continuous fashion through the appropriate co-ordination of hand and foot motions. Words are formed by making the correct motions; for example, to say ``hello'' the user forms the ``h'' sound, depresses the foot pedal and quickly moves her hand to produce the ``e'' sound, then the ``l'' sound and finally the ``o'' sound. The user has complete control of the timing and quality of the individual sounds. The articulatory mapping between gestures and speech is decided {\em a priori}. The mapping is based on a simplistic articulatory phonetic description of speech [10]. The X,Y coordinates (measured by the polhemus) are mapped to something like tongue position and height\footnote{In reality, the XY coordinates map more closely to changes in the first two formants, F1 and F2 of vowels. From the user's perspective though, the link to tongue movement is useful.} producing vowels when the user's hand is in an open configuration (see figure 2 for the correspondence and table 1 for a typical vowel configuration). Manner and place of articulation for non-stop consonants are determined by opposition of the thumb with the index and middle fingers. Table 1 shows the initial gesture mapping between static hand gestures and static articulatory positions corresponding to phonemes. The ring finger controls voicing. Only static articulatory configurations are used as training points for the neural networks, and the interpolation between them is a result of the learning but is not explicitly trained. For example, the vowel space interpolation allows the user to easily move within vowel space to produce dipthongs. Ideally, the transitions should also be learned, but in the text-to-speech formant data we use for training [11] these transitions are poor, and it is very hard to extract formant trajectories from real speech accurately.

Figure 3 Hand-position to Vowel Sound Mapping. The coordinates are specified relative to the origin at the sound A. The X and Y coordinates form a horizontal plane parallel to the floor when the user is sitting. The 11 cardinal phoneme targets are determined with the text-to-speech synthesizer. (glovetalkII-vowel-map)



TABLE 1: Examples of static gesture-to-consonant mapping. Note, each gesture corresponds to a static non-stop consonant phoneme generated by the text-to-speech synthesizer and the neural networks provide the continous interpolation. (glovetalkII-consonant-map)

Hardware and Software Tools.

There are five main pieces of hardware used in Glove-TalkII, four for input and one for output. The glove input device is a Cyberglove. This device has 18 flex sensors embedded inside a lightweight glove. The flex sensors respond linearly to bend angle and are placed strategically in the glove. These angles are measured at a frequency of about 100Hz.

The second input device is a polhemus sensor which measures the X,Y,Z, roll, pitch and yaw of the hand relative to a fixed source. The small sensor is mounted on the back of the Cyberglove on the user's forearm; thus, the six parameters are independent of the user's wrist motion. The device measures the parameters at a frequency of 60Hz.

The third input device is a ContactGlove. This device measures contact between points on the fingers to the thumb which are mapped to stop consonants.

The final input device used is a foot pedal. This device has a variable resistance which is an approximately linear function of foot depression. The variable resistance is used in a voltage divider circuit. The variable voltage is sampled by the A/D circuitry included with the computer at its lowest frequency of 8kHz. Additionally, several elastic bands have been attached to the base to provide some force feedback and also to return the foot pedal to the fully undepressed position when the user's foot is lifted.

The output device is a Loughborough Sound Images (LSI) parallel formant speech synthesizer. The device requires 16 speech parameters at 100 Hz to operate. The parameters are quantized to 6 bits (integer range [0,63]). The ten main parameters are these:

  1. ALF - low frequency amplitude; logarithmic scale
  2. F1 - first formant frequency; 115 to 1060 Hz in increments of 15 Hz
  3. A1 - first formant amplitude; logarithmic scale
  4. F2 - second formant frequency; 730 to 2620 Hz in increments of 30 Hz
  5. A2 - second formant amplitude; logarithmic scale
  6. F3 - third formant frequency; 1510 to 3400 Hz in increments of 30 Hz
  7. A3 - third formant amplitude; logarithmic scale
  8. AHF - high frequency amplitude; logarithmic scale
  9. V - degree of voicing
  10. F0 - fundamental frequency; 25 to 417 Hz using logarithmic scale

The first eight are called formants parameters and can can be thought of resonances of the vocal tract. The last two represent glottal controls. The parameters are sent to the synthesizer using a parallel port. Control of these parameters is sufficient to produce high quality speech. A text-to-speech synthesizer [11] is available which outputs formant parameters to drive the formant synthesizer and provide formant targets for training the neural networks.

All the software runs on a Silicon Graphics Personal Iris 4D/35. The Xerion Neural Network libraries simulate neural networks and run all the hardware devices [?] After all the preprocessing and data collection, there is enough computing power remaining in each 10 msec interval to simulate networks with up to 1000 floating point weights which is sufficient for Glove-TalkII to operate without significant interruption. Glove-TalkII requires about 200,000 floating point operations per second.

LEARNING TO SPEAK WITH GLOVE-TALKII}

One subject has been trained extensively to speak with Glove-TalkII. The subject is an accomplished pianist who can speak. It was anticipated that his skill in forming finger patterns for playing the piano and his musical training would transfer positively to aid his learning to speak with Glove-TalkII. The subject went through 8 learning phases during speech acquisition. The phases are:

  1. initial set-up
  2. initial network training
  3. individual phoneme formation within simple words and CV/VC pairings
  4. word formation and interword pitch control
  5. short segment formation with suprasegmental pitch control and singing
  6. passage reading
  7. fine tuning; movement control and phrasing
  8. spontaneous speech

During his training, Glove-TalkII also adapted to incorporate changes required by the subject. Of course, his progression through the stages is not as linear as suggested by the above list. Some aspects of speaking were more difficult than others, so a substantial amount of mixing of the different levels occurred. Practice at the higher levels facilitated perfecting more difficult sounds that were still being practiced at the lower levels. Also, the stages are iterative, that is, at regular intervals the subject returns to lower levels to further refine his speech. An interesting research issue would be to determine how adaptation by the user interacts with adaptation by the interface.

Initial Set-up.

The first phase was initializing the system and familiarizing the subject with the system. The subject's hand parameters were calibrated (Footnote 3) using the graphical hand and the displayed finger angles. A scale file was created from recorded hand data which is used to scale the input data between 0 and 1 for input to the neural networks. The subject was familiarized with putting on the glove and setting up Glove-TalkII.

Initial Network Training.

The second phase was initial network training. The subject familiarized himself with the initial mapping. A complete set of training data was collected from the subject. The typical data collection scheme for a single phoneme is as follows:

  1. A target consonant plays for 100 msec through the speech synthesizer.
  2. The user forms a hand configuration corresponding to the phoneme.
  3. The user depresses the foot pedal to begin recording. The start of the recording is indicated by a green square appearing on the monitor.
  4. 10-15 time steps of hand data are collected and stored with the corresponding formant targets and phoneme identifier; the end of data collection is indicated by the green square turning red.
  5. The user chooses whether to save data to a file and whether to redo the current target or move to the next one.

The first training set consisted of 2830 examples of static consonants for training the consonant network and 3502 examples (2830 consonants and 672 vowels) to train the V/C network. These data were used to train Glove-TalkII's neural networks to map the subject's interpretation of the initial gesture vocabulary. During data collection, the subject memorized the static hand configuration to static consonant mapping. In addition, he provided hand configurations most suited to his hand that approximated the initial mapping. The simplicity of the data collection procedure and the ease with which the networks train are important for Glove-TalkII to be a useful adaptive interface.

Simple Words.

The third phase involved the subject learning to say individual sounds within consonant--vowel pairs (CV), vowel--consonant pairs (VC), and simple word contexts reliably. This was the first time the subject had spoken with the system. The speech produced was unintelligible at this time; however, individual sounds were recognizable. Most importantly, the subject began to recognize different types of phoneme sounds within the sounds he produced. Phoneme recognition helps provide audio feedback for the subject to adapt his motions to produce desired effects. Much of the subject's practice time was spent repeating each of the consonant and vowels sounds, for example ``we'', ``el'', ``I''. After several hours of practice the subject determined which sounds were most difficult to produce. For these difficult sounds, more training data were collected and the networks retrained with the new data added to the original data. This process was repeated several times in an attempt to improve the consonant mapping. As the subject became more proficient at producing static hand configurations (about 5 more hours of practice) it became clear that some of the difficulty producing individual sounds was that often a mixture of vowels and consonants was being produced. This effect was caused by the V/C network having an output slightly larger than zero for consonant sounds causing the output of the vowel network to mix with the consonant sound. Vowel data and consonant data were collected to retrain the V/C network (the consonant network was also retrained with the new data). The subject found this version of the V/C network enabled him to speak individual sounds very reliably. The only phoneme not completely satisfactory was ZH (as in pleasure) which rarely occurs in English.

Words and Pitch.

The subject proceeded to stage 4: word formation and word pitch control. The subject spent time practicing common English words (from [3]). This exercise provided practice forming transitions the subject would be likely to encounter during conversational speech. While practicing these words some of the difficult transitions became evident; for example, the rightarrow transition in the word ``only''. For difficult transitions such as this one, an inverse mapping was used to assist the subject in finding the correct hand gesture timing to make the word intelligible [7]. One of the phonemes the subject found difficult to say was the R sound in words like ``are''. Using the inverse mapping and a pseudo-spectrum of the R sound it was observed that this sound is actually a vowel sound with dropping pitch. Thus, the subject experimented with the effects of pitch control on individual words to make them intelligible.

Stages 3 and 4 are less distinct than suggested above. The subject practiced individual words and individual sounds simultaneously. This amalgamation became particularly prominent as the subject became more proficient with individual sounds. Data were collected for improving the consonant sounds during the many hours of practice during phases 3 and 4.

Glove-TalkII was retrained about 10 times during these initial phases, sometimes with more data for particular phonemes and other times with replacement data. For future subjects, good performance of the V/C network must be a key focus in the early stages of learning. Several retraining sessions were probably unnecessary since the phoneme errors were caused by mixtures of vowels and consonants caused by poor vowel/consonant distinctions.

Three more significant adjustments were made after the V/C network was performing properly. First, the I position on the vowel mapping was shifted to (5,0) from (4.5,1) which is midway between EE and E (see figure 2 ). This modification was necessary because the subject had difficulty saying the I phoneme as in ``is'' reliably. This phoneme occurs frequently in English causing significant intelligibility problems. This problem was probably due to the I and E sound being placed relatively close to each other on the initial mapping, correspondingly, after the vowel network was trained, the area in the X-Y plane which produces the I sound was too small relatively. Second, the subject created another complete training set for every static phoneme sound once the V/C network performed well. The consonant network was trained with this new data set plus the data set used to train the good V/C network. Third, the entire vowel space was compressed by a factor of 0.75 since the subject found that he had to move his hand extensively in X,Y plane to speak. A factor of 0.5 was also tried but was found extreme. Another interesting attempt to provide a better vowel space was to form a radial representation of the static phonemes. Using A as the centre, the remaining 10 vowel phonemes were placed at equidistant positions along a ring 5~cm away from the A. Training data were generated by partitioning the plane into sectors formed by the mid-points between phonemes on the ring, and by specifying phoneme targets for each of the sectors sampled evenly with 60 points out to a radius of 10~cm. The subject found that already after 15 hours of training on the original vowel space the new vowel space was too different to integrate into his speech quickly. In comparison, shifting the I phoneme was easily integrated. From this observation, it appears that users can adapt relatively quickly to the first mapping, after which it becomes difficult to alter the mapping radically without significant performance penalties.

At this point, Glove-TalkII was relatively stable allowing the subject to produce static phonemes in sequence reliably. The subject could intelligibly say simple words that had been practiced. He was proficient at manipulating pitch within a word as well as getting difficult phoneme transitions, especially stop-to-vowel or stop-to-non stop consonants.

Phrases.

The next phase of the subject's learning involved saying short segments. He would combine practiced words into meaningful utterances like ``Hello, how are you?'', the alphabet song (``a-b-c-d-e-f-g... now, I've said my abc, next time won't you sing with me?''), and excerpts from [00] (i.e. ``Sam I am.'' and ``I do not like green eggs and ham.''). Much practice was required to get the word transitions correct so that they were intelligible. Further, pitch control was practiced over the whole segment to further improve intelligibility. His proficiency with pitch control was such that his version of the alphabet song was actually sung. By the end of this phase, the subject could say any simple utterance (1 to 2 syllables) intelligibly after only a few attempts. However, individual words were spoken slowly at 2.5 to 3 times slower than normal speaking rate\footnote{Normal rate is defined by the text-to-speech synthesizer rate for unrestricted text.}. Pronunciation difficulties were mostly in the following areas:

First, it is very important for vowel phonemes to sound correct to achieve proper enunciation of slow speech. With Glove-TalkII, it is difficult to know exactly which vowel will be produced until the foot pedal is depressed since there is poor absolute hand position feedback. Second, timing stop consonants is difficult since the stop phonemes are produced within 100 msec. Small timing errors produce unintelligible stop sounds. Third, if the R sound is sustained for too long, the speech produced sounds muffled and its intelligibility is impaired. Forty milliseconds should be a typical duration of the R sound, but this short timing is difficult to achieve since the static gesture required is hard to produce quickly (see table1 ).

Notice that when making the R sound, the index finger is very bent. To extend the finger requires a fairly large motion which must be made quickly to achieve the necessary transition. One technique to achieve the necessary transition speed is to form the R sound partially instead of completing the finger trajectory. This technique requires a large degree of finger control since the subject's index finger does not oppose the thumb in this case. Another alternative for some R sounds is to use one of the R sounding vowel sounds with a drop in pitch as in ``ar'' in the British pronunciation of ``farther''. Examples of R's that can be made in this fashion include UR, AR and ER as in ``curious'', ``are'', and ``curd'' respectively. This type of R sound is much easier to produce quickly. The difficulty for the subject is learning to know automatically which way to make the R sound. The subject uses a combination of a pitch drop and a short R burst as a safe alternative for unknown R contexts.

Reading.

The subject reached the sixth stage next; learning to read lengthy passages. Some of the passages include [14], [12], and the ``Little Miss Muffet'' nursery rhyme. As the subject progressed through the stages, the length of speaking time without excess fatigue increased. In the first few stages, one hour of continuous practice was exhausting. By the reading stage he was comfortable enough that 1-2 hours of continuous practice was possible. Reading exercises helped improve intonation control. For example, the children's story ``Green Eggs and Ham'' [14] has two voices with different intonation i.e.

Would you like them in a house? Would you like them with a mouse? I would not like them in a house. I would not like them with a mouse.

In addition, reading caused improvement in the three most difficult areas for producing intelligible speech: reliably producing vowel sounds, stop consonant clusters and R technique.

Several distinguishing features of the subject's speech were observed in informal listening tests. First, a strong contextual effect occurred. In particular, when a listener hears the subject speak for the first time, she sometimes does not understand a single word; rather perceives a long slurred speech-like utterance. However, once the listener is told what the utterance was and hears the subject say it again, the words become intelligible and distinguishable. Subsequent novel speech also becomes intelligible. This effect is similar to the adaptation people make when listening to speakers with strong accents or speech impairments. For familiar utterances, the subject's speech is very intelligible; for example, counting and saying the alphabet were never misunderstood even by listeners whose first language is not English. Second, the subject speaks slowly. Third, by using appropriate pitch control the subject produces some relatively natural-sounding speech compared to the text-to-speech synthesizer. As shown through interword pitch variation, proper control of pitch improves intelligibility of the subject's speech. Fourth, even with considerable practice some stops (i.e. P, T, K) are still difficult to discriminate in all contexts. While the R sound still sounds a bit muffled, after considerable practice (approximately 50 hours) the AR, ER, and UR sounds are made reliably in appropriate R-contexts, which alleviated the need for the consonant hand configuration for R to be used in these cases.

Fine Tuning.

The next phase the subject reached while learning to speak is the fine tuning stage. The subject designed exercises to help overcome problem areas and make his speech more easily understood and natural-sounding. The exercises fell into two categories: speed, and phrasing. Speed exercises involved saying individual words and phrases as quickly as possible without compromising intelligibility. The phrases used increased in length from two words up to 5 words as he became more proficient at the exercise. During these exercises the subject tried to minimize his hand motion through vowel space. An interesting artifact of speaking faster is that vowel accuracy is not vital for intelligibility. The speed exercises help overcome the poor R quality as well as improving stop onset and offset timing appropriately. The phrasing exercises involved saying utterances with appropriate pauses controlled by the foot pedal. Pitch control is further refined by synchronizing the phrasing of the speech with the intonation. It is interesting to note that when phrases in a sentence are said quickly in chunks, separated by bringing the foot pedal to the full upright position (i.e. volume turned off), the speech quality improves greatly. Together, these exercises help the subject speak better.

Conversation.

The final phase is spontaneous speech. Practicing while conversing with someone helped improve the whole spectrum of skills required for intelligible natural speech. Conversations with unaccustomed listeners are particularly useful since they had not adapted to the peculiarities of Glove-TalkII speech and forced the subject to speak well.

Some of the stages of learning the subject progressed through are similar to the stages encountered while learning to play a musical instrument. The stages can also be categorized according to Fitts' three stages of learning [8]: cognitive, associative and autonomous. Using Fitts' levels, stages 1--4 correspond to the cognitive level, stages 5--7 the associative level and stage 8 the autonomous level. One of the key features discovered while the subject was at levels 3 and 4 was that the V/C network must work well for the user to get adequate feedback about which phonemes he produces.

After 100 hours of practice the subject progressed from simple, barely speech-like noise to intelligible somewhat natural-sounding speech. The subject exhibits two levels of performance, one for rehearsed speech and one for unrehearsed. Rehearsed speech sounds similar to slow text-to-speech synthesized speech with natural intonation contours. For unrehearsed speech the subject still has difficulty pronouncing polysyllabic words intelligibly. However, with a few tries he can say any utterance found in the English language. Additionally, he can sing and make non-vocal sounds. The subject can also speak other languages. Even though Glove-TalkII has been designed for English speech sounds, it is a relatively simple matter to modify Glove-TalkII to produce speech sounds from other languages.

SUMMARY

The initial mapping for Glove-TalkII is loosely based on an articulatory model of speech. An open configuration of the hand corresponds to an unobstructed vocal tract, which in turn generates vowel sounds. Different vowel sounds are produced by movements of the hand in a horizontal X-Y plane that corresponds to movements of the first two formants which are roughly related to tongue position. Consonants other than stops are produced by closing the index, middle, or ring fingers or flexing the thumb, representing constrictions in the vocal tract. Stop consonants are produced by pressing keys on the keyboard. F0 is controlled by hand height and speaking intensity by foot pedal depression.

Glove-TalkII learns the user's interpretation of this initial mapping. The V/C network and the consonant network learn the mapping from examples generated by the user during phases of training. The vowel network is trained on examples computed from the user-defined mapping between hand-position and vowels. The F0 and volume mappings are non-adaptive.

One subject was trained to use Glove-TalkII. After 100 hours of practice he is able to speak intelligibly. The subject passed through 8 distinct stages while he learned to speak. His speech is fairly slow (1.5~to~3 times slower than normal speech) and somewhat robotic. It sounds similar to speech produced with a text-to-speech synthesizer but has a more natural intonation contour which greatly improves the intelligibility and naturalness of the speech. Reading novel passages intelligibly usually requires several attempts, especially with polysyllabic words. Intelligible spontaneous speech is possible but difficult.

ACKNOWLEDGEMENTS

We thank Peter Dayan, Sageev Oore and Mike Revow for their contributions. This research was funded by the Institute for Robotics and Intelligent Systems and NSERC. Geoffrey Hinton is the Noranda fellow of the Canadian Institute for Advanced Research.

References

  1. A. G. Bell, "Making a Talking-Machine", Beinn Bhreagh Recorder , 1909, November, 61-72
  2. Broomhead, D. and Lowe, D., Multivariable functional interpolation and adaptive networks, Complex Systems, 2, 321-355, 1988
  3. G. Dewey, Relativ Frequency of English Speech Sounds, Harvard University Press, Cambridge, Mass, 1950
  4. Homer D. and R. R. Riesz and S. S. A. Watkins, A Synthetic Speaker, Journal of the Franklin Institute, 227 (6), 1939, 739--764.
  5. S. Fels and G. Hinton, {Glove-Talk}: A Neural Network Interface Between a Data-Glove and a Speech Synthesizer, IEEE Transaction on Neural Networks, 4, 1993, 2-8.
  6. S. Fels and G. Hinton, {Glove-TalkII}: Mapping Hand Gestures to Speech Using Neural Networks, In D.S. Touretzky (Ed.) booktitle = nips 7, Denver, Morgan Kaufmann, San Mateo, year = 1995
  7. S. S. Fels, Glove-{T}alk{II}: Mapping Hand Gestures to Speech Using Neural Networks, August, Dissertation, 1994
  8. P. M. Fitts Perceptual--motor skill learning, In A. W. Melton (Ed.) Categories of human learning, New York, NY, Academic Press, 1964
  9. Rumelhart, D.E. and McClelland, J.L. and the PDP research group, Parallel distributed processing: Explorations in the microstructure of cognition. Vols. I and II Cambridge, MA, MIT Press, 1986.
  10. P. Ladefoged, A course in Phonetics (2 ed.) Harcourt Brace Javanovich, New York, 1982
  11. E. Lewis, A `{C}' Implementation of the {JSRU} Text-to-Speech System, 1989, Computer Science Dept., University of Bristol
  12. H. A. Rey, Curious George, Houghton Mifflin Company, Boston, 1941
  13. Rumelhart, D.E. and McClelland, J.L. and the PDP research group, Parallel distributed processing: Explorations in the microstructure of cognition. Vols. I and II" Cambridge, MA, MIT Press, 1986.
  14. J. M. Rye and J. N. Holmes, A versatile software parallel-formant speech synthesizer, JSRU-RR-1016, Joint Speech Research Unit, Malvern, UK, 1982.
  15. Dr. Seuss, Green Eggs and Ham Beginner Books, New York, 1960.
  16. Wolfgang Ritter von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibungeiner sprechenden Maschine. Mit einer Einleitung vonHerbert E. Brekle und Wolfgang Wild. Stuttgart-Bad Cannstatt F. Frommann, Stuttgart, 1970

FOOTNOTES

Return: Footnote 1: See [12] for an excellent introduction to neural networks and how they can be trained.

Return: Footnote 2: Quantitative analysis of each of the various neural networks on typical training data can be found in [6]. Return: Footnote 3: Calibration is performed infrequently due to the robustness of the Cyberglove sensors.