



As spoken language interfaces for real-world systems
become a practical possibility, it has become apparent that
such interfaces will need to draw on a variety of cues from
diverse sources to achieve a robustness and naturalness
approaching that of human performance [1]. However, our
knowledge of how these relationships behave in the aggregate is
still tantalizingly sketchy. We lack a strong theoretical basis
for predicting which cues will prove useful in
practice and for specifying how these cues should be combined
to signal or cancel out potential interpretations of the
communicative signal. In the research program summarized
here, I propose to develop and test an initial theory of cue
integration for spoken language interfaces.
Recently we have seen an increase in research probing specific
relationships between some of the knowledge sources
used in computational spoken language understanding; a
brief review may be found in [10].
Although several studies
have shown relationships between pairs of various potential
cues, none have attempted to study more complex interactions.
Furthermore, these findings have not been applied to
working systems. In this research program we are studying
the contributions and interrelationships of four cues in the
recognition of acknowledgments in spontaneous dialogue:
Current systems rely primarily on lexicalization to signal
speaker intention, with the context of the preceding utterance
providing additional constraints (e.g., [3],
[8], [12]).
Pause length is a strong marker for syntactic structure in
professionally read speech ([7],
[9]). We lack computational
models for understanding pause cues in spontaneous speech,
however; existing systems simply ignore pause. Pitch
changes offer additional cues about the speaker's intentions.
Pierrehumbert and Hirschberg [6]
proposed that phrasal
tunes signal relationships between the propositional content
and the mutual beliefs of the participants. More specifically,
Nakajima and Allen [4]
examined the relationship between
fundamental frequency (F0) and discourse structure in spontaneous
task-oriented dialogue and found that F0 values
tend to signal topic shift and topic continuation across pause
boundaries. Pitch accents mark salient material
[6], which
may be useful not only in interpreting the intention behind
the utterance but also in locating critical content words for
recognition purposes.
This research has two parts. In the first stage, now underway,
my goal is to establish a basis for understanding cue
interrelationships in task-oriented, mixed-initiative spontaneous
conversation. Because recognition of the speech act
motivating an utterance is central to formulating a helpful
response, my Stage 1 work is based on a speech-act level
analysis of a corpus of human-human conversations. I am
now forming an initial theory relating speech act recognition
to the four cues identified above; some early, partial results
are reported in the next section. In the second stage of this
study, I will test and refine my theory by implementing it in
the context of a working system.
To limit its scope to a manageable size, I focus my inquiry in
two ways. First, I consider only a single type of speech act,
the act of acknowledgment. Acknowledgments are used
heavily in many types of task-oriented conversations to
coordinate turn-taking and signal understanding
[5], so systems
must be able to recognize them in a reliable fashion to
achieve robust behavior in mixed-initiative interaction. Second,
I limit my inquiry to a small number of cues. I do not
assert that these cues are the only ones present in the speech
signal, nor even that they are the only important ones. Nonetheless,
understanding their interrelationships will enable
me to establish the utility of my method and to form an initial
theory that can be expanded and refined in later work.
Furthermore, I expect cues such as these to be of practical
use in spoken language interfaces because they are available
and relatively robust in existing systems. In a system
expected to participate in real-time conversational interaction,
it will be important to exploit low-level cues that are
robust and fast so that slower and more complex analysis
can be reserved for those inputs that require it.
In the work completed to date, we examined prosodic characteristics
of a word used in several distinct senses, one
sense being acknowledgment. Our results indicate that intonation
as reported by a pitch tracker can aid in disambiguating senses of
homonyms such as different usages of the
word "right" [11]. We examined the pitch patterns of 57
utterance-initial occurrences of the word "right" and found a
significant difference in the pitch change (p = 0.0375). When
"right" was used as an acknowledgment or answer (e.g. A:
"Turn left again. heading north." B: "Right."), it was more
likely to be pronounced with a falling intonation. When used
as a direction (e.g. "Right on Main street"), "right" was
more likely to occur with a rising intonation.
We did not find pitch change alone to be an adequate discriminator
of word usage; if used as the sole cue, it correctly
categorized only 67% of the occurrences. The usefulness of
this finding lies in considering local pitch change as one of
many redundant cues. For example, the direction of pitch
change could serve as a confirming cue when analyzing
ambiguous or erroneous recognizer output.
I am now expanding the prosodic study to encompass other
acknowledgment acts and to correlate the observed values
of the cues identified above to the presence or absence of
acknowledgments. The data for this study, as for the prosody
study, are drawn from the Vehicle Navigation (VN) corpus.
This is a collection of 93 brief task-oriented human-human
dialogues taking place over cellular telephone. There
are approximately 1100 occurrences of acknowledgment
speech acts in this corpus.
To assess the usefulness of my theory, I will test it in the
context of a working system with a spoken language interface.
Such a system, a scheduling application, is currently
under development as part of a separate project in our lab.
Like the VN corpus, interaction is via telephone and many
verbal acknowledgments are expected to occur. The situation
differs from the VN task in that interaction is be human-computer
instead of human-human and the specific task is
be different, so I expect it to provide a good test of the generality
and applicability of the theory.
Assessment of the theory will be based on a comparison of
the system with and without the acknowledgments theory
developed in Stage 1. I will use a within-subject design,
with each subject using both systems to complete several
scheduling tasks. I anticipate that a test suite will be developed
for evaluating the original system; I plan to draw test
cases from that suite. My metrics are similar to those proposed
by Goodine et al [2]:
A second experiment will probe details of the theory. Subjects
will complete several scheduling tasks using versions
of the system in which one or more of the cues are ignored.
Metrics, tasks and experimental design will be as in the previous
experiment. An important result expected from this
experiment will be estimates of the robustness of the various
cues under realistic conditions.
11. Ward, K. & Novick, D. G. (1995). "Prosodic Cues to
Word Usage." to appear in ICASSP-95.
Abstract
Keywords:
Spoken language interfaces
Introduction
STUDY
Stage 1: Understanding Acknowledgments
Stage 2: Assessing the Theory
References