Brian Hansen, David G. Novick, Stephen Sutton
ABSTRACT
Designers of system prompts for interactive spoken-language systems
typically seek 1) to constrain users so that they say things that the
system can understand accurately and 2) to produce ``natural''
interaction that maximizes users' satisfaction. Unfortunately, these
goals are often at odds. Keywords
Interaction design, auditory I/O, dialog analysis, design techniques,
evaluation, toolkits INTRODUCTION
Between the attainable practicality of command-based speech
recognition and the elusive attraction of ``natural'' spoken-language
interaction lies the growing use of spoken dialogue systems (SDSs).
This middle ground includes applications such as the AT&T
long-distance billing system, the OGI automated spoken questionnaire
for the U.S. Census [6], and systems performing the ATIS travel
information task [15]. These systems engage in relatively simple
task-based dialogues, often expecting users' utterances to consist of
a single word or a short phrase; they are analogous in complexity to
graphical user interfaces (GUIs). Like GUIs, SDSs generally do not
generate their output at run time; they instead use pre-specified
phrases or templates as prompts. Development of these prompts is
usually taken to be more art than science; to create the system's
prompts, designers most often rely on expert intuition and tacit
experience. But beyond intuition and experience we propose systematic
methods for characterizing and generating spoken-dialogue prompts. In
this paper, we present these methods and show their usefulness for
developing effective SDSs. Although we concentrate on SDS development,
these methods have a natural extension to the speech component of
multimedia systems. THE PROBLEM
Current speech recognition algorithms match the features of a speech
signal with models of the features of known phonemes via a statistical
process. One effect of this statistical matching is that recognition
is probabilistic. In a GUI, only the set of relevant user actions are
defined at any given moment. This is equivalent to imposing a
vocabulary of legal actions. It is generally not possible to enforce
the same rigid constraints in a spoken interface. Unfortunately, the
need for a high-degree of recognition accuracy in speaker-independent
speech recognition imposes the requirement that the words to be
recognized come from a relatively small set of candidates. The overall
effectiveness [8] of a SDS, then, is dependent upon the ability of
dialogue designers to produce prompts that constrain users' possible
responses. One of the subtleties of dialogue design lies in giving
users a feeling of naturalness and freedom of response although
underlying constraints exist. HEURISTICS, DIMENSIONS AND STYLES
To advance the production of voice-response questionnaires from an ad
hoc, mostly intuitive ``craft'' into more of an engineering
discipline, we have developed a method using a set of heuristics for
transforming [16] a written version of a questionnaire into a script
(or protocol) for use with speech-recognition systems. From these
heuristics, we then developed a systematic approach to the design of
spoken prompts; this approach is based on defining a space of possible
system prompts that can be described by a set of task-independent
descriptive dimensions. We identify a set of fourteen dimensions of
system prompts, and define a point in the space they form as a
``style'' for prompts. Heuristics for Designing Dialogues
In designing prompts for the census task, we quickly saw that for each
question there was a myriad of ways of expressing its underlying
intent. To converge on a practical number of dialogue designs to test,
we needed a principled way of deciding which, out of thousands of
wording variations, we should use. One limiting factor for this
particular project derived from the nature of the census task itself:
we needed the spoken questionnaire to be as true to the original
written form as possible in order to avoid distorting census results.
This concern led us to examine ways of transforming the original
written questionnaire into a form suitable for use with a SDS. One
result of our investigation was a set of heuristics for translating
from written to spoken media.
In our efforts, these heuristics have been useful for reducing the expected user vocabulary, reducing the effects of user intonation, mitigating the reduced level of system's understanding and interactive abilities, and compensating for the loss of visual access to the written form (including the ability to scan ahead) [3]. Perhaps the greatest benefit of this framework is that it encourages an empirical approach to dialogue design. By making and testing predictions about the effects of various styles, we can reject inappropriate dialogue styles and reduce the dialogue designer's reliance on intuition and hand-crafting.
The following sections describe individual heuristics we devised in the course of the census project. In many cases the styles associated with a heuristic are presented as hypothetical interactions between the system (``S'') and a user (``U'').
Style 1.1: S: What is your home phone number (including area code)? U: I don't have a phone. Style 1.2: S: Do you have a home phone number? U: Yes. S: What is your home phone number (including area code)? U: <telephone number>
Alternately, style 1.2, employs a ``guard question'' that reduces the difficulty of interpreting a response where the precondition does not hold. It also increases the length of the interaction, though it may be relevant for only a fraction of the cases encountered. In those cases, guard questions may reduce the chance of communication breakdown. In the context of a large number of yes/no questions, however, style 1.2 could become tedious for users.
Style 2.1: S: What is your home phone number U: 503...um... 690... Style 2.2: S: We need to know your home phone number. S: What is the area code? U: 503 S: and the number? U: <tel number> Style 2.3: S: We need to know your home phone number. S: What is the area code? U: 503 S: and the exchange? U: 226 S: and... Style 2.4: S: We need to know your home phone number. S: Please state the area code <pause> 3-digit exchange, and 4-digit... U: 503 226 2... Style 2.5: S: We need to know your home phone number. Please state your number, area code first. U: 503 S: Mmm-hmm U: 690 S: Yes. U: 1121 S: 1121. Ok.
Style 2.4, like style 2.3, specifies each component the user is
expected to provide, sharing with style 2.3 the danger of confusing
people unfamiliar with the notion of a telephone ``exchange'' or, more
generally, the names of the individual components. Style 2.4
encourages the user, however, to supply all components within a single
turn at speech. The need to forestall extended repair sub-dialogues
may require that the system offer acceptances [4] of users' utterances
after the components of multi-part answers are received. Style 2.5
depicts such a case in which the system provides feedback in the form
of acknowledgments and echoing [2,10].
Questions Involving Choice Between Two Options
The heuristic depicted in Figure 3 describes the different ways of
asking questions where only two responses are expected (for example
``Are you male or female?''). Style 3.1 invites uncooperative users to
answer ``Yes'' or ``No'', especially if minimal or non-intuitive
intonation is used in presenting the question. This may require a
clarifying repair sub-dialogue perhaps employing a style-3.2 type
interaction.
Style 3.1: S: A or B?
U: B
Style 3.2: S: A?
U: No.
S: B?
U: Yes.
Style 3.3: S: A?
U: No.
S: Then B, correct?
U: Yes/No.
Figure 3: Questions involving choice between two options
Style 3.2 increases the number of interactions required in the average
case, subsequently increasing the survey time overall. Further, if the
two options are truly mutually exclusive, users, recognizing the
overall intent of the series of questions, and volunteer the answer to
the underlying question (e.g., U: ``No, I'm a B.''), or worse (U: ``If
I said I wasn't female, then what else could I be but male?''). In
both cases the variety and complexity of expressions that must be
recognized are greatly increased.
Questions Involving Choice Among Three to Six Options
Figure 4 depicts a heuristic for multiple choice questions having more
than two but still only a few alternatives. We judge that among the
different treatments, style 4.1 is somewhat less natural than styles
4.2 and 4.3. This is especially true for questions having
stereotypical answers (E.g. ``What's your marital status?''
``Single''). It is slightly less natural than Style 4.2, because a
human operator can compensate for the user's not mentioning an option
name directly and can either interpret a response as indicating a
category, or can move toward a Style 4.3 interaction if necessary.
While style 4.1 may be expected to elicit more constrained responses,
it may suggest that the user cannot be trusted to recognize the
choices, an indication that may appear to be insulting or
condescending if obvious choices are spelled out.
Style 4.1: S: <ask question, give options>
U: <option-name>
Style 4.2: S: <ask question without giving options>
U: <option-name>
Style 4.3: <transform question into series of sub-questions
(a decision tree) having yes/no answers>
Style 4.4: <for each option, ask if it is the case>
Style 4.5: Similar to style 3, except when number of options
is reduced to 2-3, ask for the option-name
Figure 4: Questions involving choice among three to six options
Questions Involving Choice Among More Than Six Options
The analysis here is similar to that for styles presented in the
previous section except that with more choices the problems become
more severe. Use of style 4.1 for more than six options may put a
severe strain on the user's short- term memory, while style 4.2 may
leave the user even more adrift as to what exactly constitutes a
proper answer. The decision tree of style 4.3 becomes deeper, though
not so quickly as the option-checking sequence of style 4.4, which
becomes clearly unnatural as the number of options increases.
Style 5.1: <reduce problem to fewer options and include
``other'', then use more choice-constrained
heuristics, in the case of ``other'', either store
what the user says for later interpretation, or ask
the same question with the next group of options>
Style 5.2: S: <ask question, give explanation of n-at-a-time
style, loop through the options n at a time>
U: <option name, or special phrases for user
initiated repair>
Figure 5: Questions involving choice among more than six options
Encouraging Brief Answers
Figure 6 shows three different styles for eliciting brief, concise
answers.Of these, style 6.1 is quick and formal, though not
particularly ``friendly,'' and is likely to evoke a reasonably
focussed response. Style 6.2 takes longer but is likely to elicit
fewer open-ended responses. It is also likely to be frustrating for
expert users. Style 6.3 is most natural in presentation but does
little to constrain the response. Style 6.3 might require increasing
the coverage of grammar to accommodate more verbose or non-standard
responses, thereby decreasing recognition accuracy
Style 6.1: Give ``telegraphic'' questions. For example,
S: Date of birth?
Style 6.2: Explicitly state what information is wanted, and what
form it should take as a parenthetical to the
question. for example,
S: ``We now ask about your date of birth. Please say
the month, the day and then the year or your birth.''
Style 6.3: Phrase question ``naturally'' and hope user provides a
short, appropriate response. For example,
S: ``What is your date of birth?''
Figure 6: Encouraging brief answers
Other Heuristics
In this section we briefly describe some additional heuristics that
serve to illustrate the breadth and utility of this approach. In
particular, we sketch the expected trade-offs of using:
It is difficult, using current speech recognition methods, to accurately gauge when a user has finished his or her turn at speech. Moreover, it is difficult to provide timely feedback to the user as to whose turn it is. We have identified at least three possible implementations of turn-taking. If the system employs ``natural'' intonation patterns to signal end-of-turn, it may encourage users to encode information in intonation, possibly causing misunderstanding. If it relies only on illocutionary expectations, the dialogue may be vulnerable to communication breakdowns following turn confusion. If it uses beeps or other tone patterns to indicate turn completion, it may require some explanation to the user, increasing the number of utterances made by the system.
For questions that require prior explanations, there are two general styles: 1) provide as short an explanation as possible, or 2) provide longer explanations. Longer or more frequent explanatory text describing the intent of the question or the form of the expected answer tends to increase the output time and the output vocabulary. Increasing the output vocabulary may serve to entrain users into believing the system is able to recognize a large vocabulary, leading them to use out-of-vocabulary keywords or complex grammatical constructs.
In the case of the system's voice (either recorded human or computer synthesized), we expect to find that users react negatively to the use of synthesized speech. Not only is such technology not ``natural,'' but often difficult for human hearers to understand. We expect, however, that users provide more concise answers when prompted by a synthesized voice. In the course of developing the census system, this heuristic was tested [11] with mixed results.
Related to system voice is the choice of the persona within which the system interacts [9]. Although we make no clear prediction as to the effects of varying the persona on speech recognition accuracy, the choice of persona may affect users' acceptance of the system. Different personas in our case included the government, a census taker, or a spokesperson. In the census project, the system persona was an anonymous census enumerator.
An area not explicitly tested in the census project was to vary the rate of speech of the system voice. On one hand, we predict that faster speech may be more compelling but entrains users to use faster speech in response, possibly degrading speech recognition accuracy. A slower rate of speech, on the other hand, may increase user frustration and lead to users interrupting (or ``barging in on'') the system voice, again degrading recognition accuracy.
Finally, as the census project was concerned primarily with asking questions, we did not develop extensive heuristics addressing how best to convey information to, or answer questions of, the users. Where the objective is primarily to convey information to the user, the quickest style for presenting information would be simply to present it and go on to the next stage of the dialogue. If it were critical that the information be understood, the system might ask for confirmation and go on if confirmed. Alternately, if the system detected silence or sounds indicating that the user was uncertain or did not understand, it could present the information again or inquire as to possible sources of misunderstandings.
The dimensions may be thought of as naming a way of varying a system prompt. The dimension PreExplanation, for example, denotes the degree to which the intent behind the prompt is described to the user before the question is actually given. Although in this case, as in many of the other dimensions, a whole continuum could be imagined, we often limited our analysis to polar opposites (e.g. +PreExplanation and -PreExplanation). In other cases, such as Decomposition, ordering the points within the dimension was less clear.
By revisiting the various styles for each of the heuristics, we identified a set of dimensions characterizing the phrasing of system prompts. These dimensions include the following ten:
In addition, we also identified a number of dimensions characterizing the interaction as a whole, including:
In total, these dimensions define a fourteen-dimensional space of system prompts.
Style 1: (+Terse, -PreExplanation, -ListOptions) S: Marital status? Style 2: (+Terse, -PreExplanation, +ListOptions) S: Marital status? Now married, widowed, divorced, separated, or never married? Style 3: (-Terse, -PreExplanation, +ListOptions) S: What is your marital status, now married, widowed, divorced, separated, or never married? Style 4: (+PartialDecisionTree, -Terse, +GiveOptions, -PreExplanation) S: Are you now married (yes or no)? if no, then S: Have you ever been married (yes or no)? if yes, then S: Were you widowed, divorced or separated (please say one)? Style 5: (+PreExplanation, +ListOptions, -Terse, +GiveOption) S: The next question will determine your marital status. The categories are: now married, widowed, divorced, separated, and never married. What is your marital status?
One of the advantages of using styles defined in terms of features is that it allows us to characterize the overall style of the interaction rather than limiting our analysis to identifying the style of a single prompt. We thus define overall stylistic consistency of a SDS as the property of a dialogue in which the styles associated with each prompt do not conflict.
The iterative approach required us to produce a method for assessing the merit of each design as a basis for further refinement. We addressed this problem from two perspectives: accuracy of recognition and naturalness of interaction. To evaluate our dialogue designs we used an objective measure of the conciseness of users' responses in combination with a subjective measure of naturalness as reflected in users' feedback to evaluation questions. Together these metrics supplied grounds for making a wide range of dialogue design decisions, including evaluating candidate styles. In addition, these evaluation metrics provided a means to test the predictions made by our heuristics. These predictions effectively narrowed the search space of subsequent prompt refinements.
We now briefly present a behavioral coding scheme, a subjective evaluation metric, and some results from using our approach to dialogue development for the census system.
Response Class Description System prompt User response
Adequate Answer 1 Answer is concise and responsive. Have you ever been
married? Yes
Adequate Answer 2 Answer is usable but not concise. Have you ever been
married? No I haven't
Adequate Answer 3 Answer is responsive but not usable. Have you ever
been married? Unfortunately
Inadequate Answer 1 Answer does not appear to be responsive. What is
your sex, female or male? Neither
Inadequate Answer 2 User says nothing at all. What is your sex, female
or male? <silence>
Qualified Answer User expresses uncertainty What year were you born?
Nineteen fifty five I think
Req for Clarification User reqs clarification of the meaning of a
question. Are you black, white or other? What do you mean?
Interruption User interrupts the speaking of the question. What year
were you born? *teen fifty five
Don't Know User responds ``I don't know'' or equivalent. Are you black,
white or other? I'm not sure
Refusal User refuses to answer. What year were you born? I'm not
telling you
Other User behavior not captured by the above codes What year were you
born? Thirt... <noise>
Table 1: Summary of behavioral coding scheme
The BCS can characterize a set of utterances; the distribution of BCS
codes associated with responses to a given question in different
treatments, or regions of the space of dialogue prompts, can be used
as a basis for evaluation. For example, suppose we have three
candidate prompt styles and wish to select the one that is the most
constraining. First, we collect data for the three prompt styles, then
label these data according to the BCS. Comparing the frequency of
class ``Adequate Answer 1'' for the three styles shows which style
elicited the most constrained responses.
Subjective Evaluation Questions
Given the potential trade-off between recognition accuracy and
naturalness of interaction, reliance on the BCS as our sole criterion
when designing prompts might lead us to dialogues that were very
effective from the standpoint of eliciting highly recognizable
responses but rather awkward or frustrating for users. We balance the
behavioral coding evaluation of prompt styles with feedback from
users. We solicited this feedback through evaluation questions
presented at the end of the questionnaire providing users the
opportunity to express their likes and dislikes regarding any aspect
of the dialogue, including question topics, the wording of prompts,
and the manner in which the prompts were presented. Results of Evaluations
We used the behavioral coding scheme (and its predecessor versions),
task completion rates, and responses to evaluation questions in three
formal rounds of dialogue development in the Census project. The first
round (based on roughly 100 callers) involved comparisons of the
strongest differences among three overall styles. Our evaluation
enabled us to pursue only those designs that elicited constrained
answers and were generally acceptable to users. DEPLOYMENT OF STYLES IN CSLURP TOOLKIT
The styles developed in the census project are proving useful in a
broad range of applications. As part of on-going research in SDSs we
have incorporated the notion of dialogue style into a toolkit [1] for
creating spoken-language applications. This toolkit provides
state-of-the-art speaker- and vocabulary-independent spoken-language
recognition technology allowing developers to design, test and deploy
spoken language interfaces rapidly for useful (real world)
applications. The toolkit greatly simplifies the process of specifying
a SDS by use of the Center for Spoken Language Understanding's rapid
prototyper (CSLUrp), a graphically-based SDS authoring environment.
CSLUrp currently provides a small set of style templates which a
developer may use to generate a prompt. A corresponding template is
displayed and slots in the template are filled with current vocabulary
items.
Style Generated prompt
Polite1 Please choose one of the following options: small, medium or
large.
Polite2 Please say: small, medium or large.
Terse Small, medium or large?
Table 2: CSLUrp generated prompts
CONCLUSION
In this paper we have presented a set of heuristics describing
different styles of transforming a written questionnaire into a form
usable with a SDS. We have identified a set of features that
characterize these styles and a set of dimensions that cover and
contain the feature set. Taken together, the fourteen dimensions we
have presented define a space of system prompts. We refine the notion
of ``style'' as being a set of features characterizing a prompt.
Alternately, we define a prompt style as being a region in a space of
prompts. ACKNOWLDEGMENTS
This research was funded by the U.S. Bureau of the Census, U S WEST,
the Office of Naval Research, the National Science Foundation, ARPA
and the OGI CSLU. REFERENCES
Colton, D., Cole, R., Novick, D., & Sutton, S. A laboratory course
for designing and testing spoken dialogue systems, Proceedings of
ICASSP-96, Atlanta, GA, May, 1996 (in press).