Nicole Yankelovich and Cynthia D. McLain*
The Office Monitor is a proof-of-concept project designed to enable members of a work group or other office visitors to leave quick messages or indicate that they stopped by. It also makes it easy for office occupants to let others know their schedule and perhaps their whereabouts. In general, the Office Monitor is intended to augment the informal, everyday communication that goes on in a typical office.
In its current incarnation, the Office Monitor consists of a lifelike mannequin, a microphone on a stand, speakers, and a motion detector. To use the Office Monitor, the office occupant turns it on when leaving the office. The occupant may either record an out-going message (e.g., "I'm at lunch and will be back by 1:30 at the latest") or leave without recording anything.
The Office Monitor greets the visitor when the motion detector is triggered. If the visitor is looking for the office occupant, the Office Monitor relates any relevant public calendar information, plays the out-going message (if there is one), and invites the visitor to leave a message.
The Office Monitor speech interface is built using the SpeechActs framework [2]. This framework supports multiple speech recognizers, multiple speech synthesizers, a natural language processor, and a discourse manager. To date, the applications created with SpeechActs have all been telephone-based. The Office Monitor represents an attempt to use speech technology to support conversational interaction in a physical setting.
Although this study was short and informal, we made several observations. First, we observed that even when conversing with the researcher, visitors did not go further than the threshold unless they were dropping something off. Another observation was that the conversations were quite short. They ranged from 10 to 20 seconds. Here's an example of one of the shorter interactions:
Researcher: Hi. R-'s in AW's office. May I take a message? Visitor: Ummm, I think I'll go find her there. Thanks.
Almost all of the conversations followed a characteristic pattern. They opened with a brief, informal greeting. Following this was a question/answer segment, and then some sort of thank-you or non-verbal closing, such as a nod.
O.M: Hi. Are you looking for Cynthia?
User: Yes, I am.
O.M: Cynthia has an appointment until 2pm. She left
this message at 12:30:
"Hi. I'll be in the Faber conference room most
of the afternoon."
Would you like to leave her a message?
User: Cynthia, stop by my office when you can.
O.M: I didn't understand what you said, but I recorded
it anyway. Say save it, forget it, or rerecord it.
User: Save it.
O.M: Sorry. I didn't catch your name.
User: This is Nicole.
O.M: Thanks. I'll give Cynthia your message.
Another design challenge we faced, due to technological limitations of the speech recognizer, was getting users to speak directly into the microphone. We solved this problem by a combination of timing and explicit prompting. The Office Monitor speaks a greeting when motion is detected. If there is no response in several seconds, the Office Monitor guesses that someone has spoken and says "I didn't hear what you said. Please speak into the microphone."
The challenge that we had the most difficult time addressing was getting people to cross the threshold of an empty office. Particularly at Sun, where office doorways have windows, it is not expedient for a passer-by to stop in an obviously unoccupied office. Secondly, some perceive it as an invasion of privacy to enter an office uninvited. We temporarily addressed this problem with the prototype by bribing people to enter the office with a basket of candy. The mannequin also served to lure some people close enough to the threshold to trigger the motion detector.
Each participant received an e-mail survey after their encounter. This included one person who showed up at the pre-arranged time, but did not cross the threshold. In the survey, 11% characterized their experience as "overall negative," 28% as "mixed," and 61% as "overall positive." These positive responses were given despite delays in speech recognition which created unnatural pauses in the conversation. Aside from detracting from naturalness, the slow pacing tended to create ambiguity about when the conversation was over. In analyzing the videotaped interactions, we noticed that many participants started to walk away, only to be pulled back each time the computer spoke. Correspondingly, in the follow-up survey, over 75% of the participants said they felt compelled to complete the interaction. With regards to the mannequin, some users thought it made the voice less startling and provided a conversational focus, but others thought it was "weird," "irrelevant," and too large. Although no participants explicitly mentioned it as a problem, the interactions were considerably longer in the user study (20 seconds to 2 minutes) than in the pre-design study (10 to 20 seconds).
As members of a speech group, we began our design as a speech-only endeavor. After conducting two user studies; however, we have concluded that a multi-modal approach, combining speech with graphics and perhaps video, might more effectively address users' needs.
2. Yankelovich, Nicole and Eric Baatz. "SpeechActs: A Framework for Building Speech Applications," AVIOS `94 Conference Proceedings, San Jose, CA, September 20-23, 1994.