CHI '95 ProceedingsTopIndexes
Doctoral ConsortiumTOC

Integrating Multiple Cues for Spoken Language Understanding

Karen Ward

Oregon Graduate Institute of Science & Technology
20000 NW Walker Road
Beaverton, Oregon 97006 USA
+1 (503) 690-1121
wardk@cse.ogi.edu

© ACM

Abstract

As spoken language interfaces for real-world systems become a practical possibility, it has become apparent that such interfaces will need to draw on a variety of cues from diverse sources to achieve a robustness and naturalness approaching that of human performance [1]. However, our knowledge of how these relationships behave in the aggregate is still tantalizingly sketchy. We lack a strong theoretical basis for predicting which cues will prove useful in practice and for specifying how these cues should be combined to signal or cancel out potential interpretations of the communicative signal. In the research program summarized here, I propose to develop and test an initial theory of cue integration for spoken language interfaces.

Keywords:

Spoken language interfaces

Introduction

Recently we have seen an increase in research probing specific relationships between some of the knowledge sources used in computational spoken language understanding; a brief review may be found in [10]. Although several studies have shown relationships between pairs of various potential cues, none have attempted to study more complex interactions. Furthermore, these findings have not been applied to working systems. In this research program we are studying the contributions and interrelationships of four cues in the recognition of acknowledgments in spontaneous dialogue:

Current systems rely primarily on lexicalization to signal speaker intention, with the context of the preceding utterance providing additional constraints (e.g., [3], [8], [12]). Pause length is a strong marker for syntactic structure in professionally read speech ([7], [9]). We lack computational models for understanding pause cues in spontaneous speech, however; existing systems simply ignore pause. Pitch changes offer additional cues about the speaker's intentions. Pierrehumbert and Hirschberg [6] proposed that phrasal tunes signal relationships between the propositional content and the mutual beliefs of the participants. More specifically, Nakajima and Allen [4] examined the relationship between fundamental frequency (F0) and discourse structure in spontaneous task-oriented dialogue and found that F0 values tend to signal topic shift and topic continuation across pause boundaries. Pitch accents mark salient material [6], which may be useful not only in interpreting the intention behind the utterance but also in locating critical content words for recognition purposes.

STUDY

This research has two parts. In the first stage, now underway, my goal is to establish a basis for understanding cue interrelationships in task-oriented, mixed-initiative spontaneous conversation. Because recognition of the speech act motivating an utterance is central to formulating a helpful response, my Stage 1 work is based on a speech-act level analysis of a corpus of human-human conversations. I am now forming an initial theory relating speech act recognition to the four cues identified above; some early, partial results are reported in the next section. In the second stage of this study, I will test and refine my theory by implementing it in the context of a working system.

To limit its scope to a manageable size, I focus my inquiry in two ways. First, I consider only a single type of speech act, the act of acknowledgment. Acknowledgments are used heavily in many types of task-oriented conversations to coordinate turn-taking and signal understanding [5], so systems must be able to recognize them in a reliable fashion to achieve robust behavior in mixed-initiative interaction. Second, I limit my inquiry to a small number of cues. I do not assert that these cues are the only ones present in the speech signal, nor even that they are the only important ones. Nonetheless, understanding their interrelationships will enable me to establish the utility of my method and to form an initial theory that can be expanded and refined in later work. Furthermore, I expect cues such as these to be of practical use in spoken language interfaces because they are available and relatively robust in existing systems. In a system expected to participate in real-time conversational interaction, it will be important to exploit low-level cues that are robust and fast so that slower and more complex analysis can be reserved for those inputs that require it.

Stage 1: Understanding Acknowledgments

In the work completed to date, we examined prosodic characteristics of a word used in several distinct senses, one sense being acknowledgment. Our results indicate that intonation as reported by a pitch tracker can aid in disambiguating senses of homonyms such as different usages of the word "right" [11]. We examined the pitch patterns of 57 utterance-initial occurrences of the word "right" and found a significant difference in the pitch change (p = 0.0375). When "right" was used as an acknowledgment or answer (e.g. A: "Turn left again. heading north." B: "Right."), it was more likely to be pronounced with a falling intonation. When used as a direction (e.g. "Right on Main street"), "right" was more likely to occur with a rising intonation.

We did not find pitch change alone to be an adequate discriminator of word usage; if used as the sole cue, it correctly categorized only 67% of the occurrences. The usefulness of this finding lies in considering local pitch change as one of many redundant cues. For example, the direction of pitch change could serve as a confirming cue when analyzing ambiguous or erroneous recognizer output.

I am now expanding the prosodic study to encompass other acknowledgment acts and to correlate the observed values of the cues identified above to the presence or absence of acknowledgments. The data for this study, as for the prosody study, are drawn from the Vehicle Navigation (VN) corpus. This is a collection of 93 brief task-oriented human-human dialogues taking place over cellular telephone. There are approximately 1100 occurrences of acknowledgment speech acts in this corpus.

Stage 2: Assessing the Theory

To assess the usefulness of my theory, I will test it in the context of a working system with a spoken language interface. Such a system, a scheduling application, is currently under development as part of a separate project in our lab. Like the VN corpus, interaction is via telephone and many verbal acknowledgments are expected to occur. The situation differs from the VN task in that interaction is be human-computer instead of human-human and the specific task is be different, so I expect it to provide a good test of the generality and applicability of the theory.

Assessment of the theory will be based on a comparison of the system with and without the acknowledgments theory developed in Stage 1. I will use a within-subject design, with each subject using both systems to complete several scheduling tasks. I anticipate that a test suite will be developed for evaluating the original system; I plan to draw test cases from that suite. My metrics are similar to those proposed by Goodine et al [2]:

A second experiment will probe details of the theory. Subjects will complete several scheduling tasks using versions of the system in which one or more of the cues are ignored. Metrics, tasks and experimental design will be as in the previous experiment. An important result expected from this experiment will be estimates of the robustness of the various cues under realistic conditions.

References

1. Cole, R. A., Hirschman, L. et al (1992). "Workshop on Spoken Language Understanding," Oregon Graduate Institute Technical Report No. CS/E 92-014.

2. Goodine, D., Hirschman, L., Polifroni, J., Seneff, S., & Zue, V. (1992). "Evaluating Interactive Spoken Language Systems," Proceedings of the 1992 International Conference on Spoken Language processing (ICSLP 92), pp. 197-200.

3. Issar, S. & Ward, W. (1993). "CMU's Robust Spoken Language Understanding System," Eurospeech `93, pp. 2147-2150.

4. Nakajima, S. & Allen, J. (1993). "A Study on Prosody and Discourse Structure in Cooperative Dialogues," Rochester Tech Report No. TRAINS-TN93-2, Sept. 1993.

5. Novick, D. G. & Sutton, S. (1994). "An Empirical Model of Acknowledgment for Spoken-Language Systems," in Proceedings of the 32nd Annual meeting of the Association for Computational Linguistics, pp. 96-101.

6. Pierrehumbert, J. & Hirschberg, J. (1990). "The Meaning of Intonational Contours in the Interpretation of Discourse," in Intentions in Communication, P. Cohen, J. Morgan, & M. Pollack (Eds.), Chapter 14, pp. 271-311, Cambridge, MS:MIT Press.

7. Price, P., Ostendorf, M., Shattuck-Hufnagel, S., & Fong, C. (1991). "The Use of Prosody in Syntactic Disambiguation," in Proceedings of the Fourth DARPA Workshop on Speech and Natural Language, Patti Price (Ed.).

8. Seneff, S. (1992). "TINA: A Natural Language System for Spoken Language Applications," Computational Linguistics, Vol 18, No. 1, pp. 61-86.

9. Wang, M. Q. & Hirschberg, J. (1992). "Automatic Classification of Intonational Phrase Boundaries," Computer Speech and Language, Vol. 6, pp. 175-196.

10. Ward, K. & Novick, D. G. (1994). "On the Need for a Theory of Integration of Knowledge Sources for Spoken Language Understanding." Proceedings of the AAAI-94 Workshop on the Integration of Natural Language and Speech Processing, July 1994, pp. 23-30.

11. Ward, K. & Novick, D. G. (1995). "Prosodic Cues to Word Usage." to appear in ICASSP-95.

12. Young, S. & Ward, W. (1993). "Semantic and Pragmatically Based Re-Recognition of Spontaneous Speech," Eurospeech `93, pp. 2243-2246.