CHI '95 ProceedingsTopIndexes
PostersTOC

Simulation-Based Dialogue Design for Speech-Controlled Telephone Services

Ivan Bretan (1), Anna-Lena Ereback (1), Catriona MacDermid (2), and Annika Waern (1) 1 Swedish Institute of Computer Science, Box 1263, S-164 28 KISTA, Sweden
2 Telia Research AB, S-136 80 HANINGE, Sweden © ACM

Abstract

A design methodology for speech-controlled telephone services has been developed using Wizard-of-Oz simulations as the principal mechanism for evaluating and getting input for dialogue design. This methodology may enable service developers to support dialogues that are optimal with respect to naturalness, especially on a pragmatic level, given the technical restrictions at hand.

Keywords

Speech interfaces, Wizard-of-Oz simulations, telephone services.

METHODOLOGY DEVELOPMENT

Automatic speech understanding is becoming an increasingly attractive option for providing advanced services to a broad audience while exploiting optimally the limited bandwidth of the telephone channel. Most existing speech-controlled services are based around small vocabularies and isolated word recognition, but as continuous speech recognition technology matures, this will change. Our hypothesis in the DISA (Design for Input Speech Adaptation) project has been that, regardless of the quality of the speech recognition and natural language processing technology available, all such services may benefit from having dialogues derived from task analysis and Wizard-of-Oz simulation studies. In other words, even though natural syntax and semantics cannot be supported, natural pragmatics may still be of use (and to some extent compensate for the limitations).

While our studies have produced both service- and dialogue-specific data, obtaining meta-results has been the real objective, giving us feedback on how to pursue simulation-centered dialogue design.

WIZARD-OF-OZ SIMULATIONS

Several studies have been undertaken in the area of simulating speech-understanding systems [1,3,4], as well as systems interacting through written natural language [2,5], giving suggestions on how to set up such experiments. In the same spirit as most of these experiments, DISA used the simulation set-up shown in FIGURE 1.

Figure 1: Simulation set-up

The DISA dialogue design methodology comprises at least two separate stages of simulation. The first stage is based on soliciting unrestricted, task-oriented, human-human dialogues. For the second round of simulations, a rudimentary dialogue model is used, describing how the target service should behave in different stages of the dialogue.

RESULTS

Methodological results

Through basing the simulations in the second stage on a rudimentary dialogue model, the behaviour of an automated service could be approximated. A support tool, the Wizard's Answering Device (WAND) represented this basic organization of the different parts of the dialogues as a set of panels each having several groups of messages, arranged according to subtask, giving the wizard guidance on what answers were appropriate in a given situation.

WAND defines the communicative space of the simulated system - what the user can do and talk about. Since the type of simulation we have strived for assumes that the service has limited capabilities of understanding, the dialogue model embodied in WAND allows for handling only requests that could be mapped to corresponding functionality in the service in question. When it came to general speech understanding competence, this stage of the simulation was liberal, and assumed rapid speaker- independent continuous speech recognition with wide grammatical coverage. The rationale for this was the fact that we believed that a dialogue model derived under these circumstances would best support the spectrum from linguistically impoverished one-word command interaction to full-fledged continuous speech understanding. Designing dialogues which are adaptive with respect to the sophistication of the technology available as well as the user's need for system control is an explicit goal of the project.

As far as the actual generation of speech output is concerned, the options for making a wizard seem machine- like have consisted of using text-to-speech conversion on one hand, and voice distortion on the other. However, state- of-the-art speech synthesis is generally perceived as less intelligible than human speech, and distorted human speech is by definition less easy to understand than normal speech. Both are unlikely to have any resemblance to the voice output of real automated telephone services. It turned out that digitized, undistorted, spoken messages of the same type that are used in existing automated telephone services (such as voice mail systems) were quite sufficient to give the impression that the subjects were communicating with a machine (19 out of 20 subjects in our most elaborate study believed this). Two factors contributed to this: (1) messages were spoken in the same way as in these services (friendly but formal); and (2) the relatively strange but consistent prosody which is a result of using combinations of canned spoken messages.

General observations

A number of observations were made in connection with the principal study so far (a simulation of a speech-controlled voice-mailbox) that we believe generalize to other speech- controlled services. Most importantly, we verified what has been reported elsewhere [6,7], that convergence and colouring phenomena are prevalent. In subsequent interviews, subjects even explicitly expressed the need they had felt to imitate system language in order to compensate for the lack of a clear model of the competence of the dialogue counterpart. The need for a feedback model reflecting the system's competence also proved important, i.e. making explicit what the system is unable to hear (corresponding to low acoustic score in the speech recognizer), what it does not understand (vocabulary unrelated to the service) and what it can't do (when the functionality of the service is a limitation).

Service-specific data

For every service subject to this type of simulation, a lot of data can be gathered in the form of transcribed dialogues. Following careful analysis, it is possible to revise and refine the dialogue model to reflect in more detail an organization of the dialogue concerning tasks and subtasks that makes sense to users. An important part of this refined dialogue model is information about the vocabulary that users will want to use in order to carry out different tasks. Already during free dialogue collections, some data may have been obtained (task organization and corresponding vocabulary), but at this later stage the data will be much more structured, through the use of the dialogue model.

FUTURE WORK

The third stage of the simulations, which we are currently working with, aims at integrating more detailed dialogue models into the simulations, using data from the second stage. In this scenario, the models will not only contain information about which messages are associated with which subtasks, but also which type of user requests will trigger what system messages, what the flow of dialogue looks like, the dynamics of system control and initiative, and what meta dialogues (such as help, requests for clarification, etc.) can be initiated. This model will be possible to load into WAND, in effect replacing parts of the cognitive processing of the wizard. The purpose of this is to approach the restrictions of actually implementable speech understanding as closely as possible. In addition to using these dialogue models in "generation mode" during the simulations, it will in fact be possible to use them also as a part of the language analysis machinery in the developed service.

ACKNOWLEDGEMENTS

We thank Björn Bergström of Telia Research, who deve-loped the first version of WAND.

References

1 Amalberti, R., Carbonell, N., and Falzon, P. (1993) "User Representations of Computer Systems in Human- Computer Speech Interaction," Int. J. Man-Machine Studies, vol. 38, 547-566.
2 Dahlbäck, N., Jönsson, A., and Ahrenberg, L. (1993) "Wizard of Oz Studies - Why and How", Proceedings of the Workshop on Intelligent User Interfaces, Orlando, Florida.
3 Dybkjaer, L. and Dybkjaer, H. (1993) "Wizard-of-Oz Experiments in the Development of the Dialogue Model for P1", Report 3a, Spoken Language Dialogue Systems, STC Aalborg University, CCI Roskilde University, CST University of Copenhagen, Denmark.
4 Fraser, N. and Gilbert, N. (1991) "Simulating Speech Systems", Computer Speech and Language, vol. 5, 81-89.
5 Hauptmann, A. and Rudnicky, A. (1988) "Talking to Computers: An Empirical Investigation," Int. J. Man- Machine Studies, vol. 28, 583-604.
6 Karlgren, J. (1992). "The Interaction of Discourse Modality and User Expectations in Human-Computer Dialog," Licentiate Thesis at the Dept. of Computer and Systems Sciences, University of Stockholm, Sweden.
7 Leiser, R. G. (1989) "Exploiting Convergence to Improve Natural Language Understanding," Interacting with Computers, vol. 1, 284-298.