



Figure 1. Picture of CyberBELT interface.
Figure 2. Picture of user and the CyberBELT interface
If she dislikes a clip she can interrupt it to
return to the previous collage of clips by saying "back", to go to
the next collage by saying "skip", or to go to the next chapter
by saying "next". If the viewer remains passive and does not
select a thread, CyberBELT moves on to the next chapter.
CyberBELT uses the eye gaze of the viewer at two phases of the
interaction. When the viewer is looking at a collage of icons to select
the next thread, the system reveals explanatory text under any icon
she is looking at. It explains how the thread will relate to the
previous clip and reveals a brief quote. The system reveals text
until the viewer selects a thread or until the selection period
times out and the story moves to the next chapter. CyberBELT
also uses eye tracking information to gauge the viewer's level
of attention while watching a clip. If the viewer's gaze wonders
outside the playing clip, she is not attentively watching the
clip; we assume she is not interested.
While speaking and pointing are active feedback, eye-gaze tracking
provides passive feedback. Eye-tracking is a powerful and
non-invasive way to monitor where the user is focusing her attention.
Figure 3. Picture of user and the CyberBELT interface.
As full content representation of video is a complex task [7], we
annotated the clips with the speaker, the topics of the conversation
and the point of view of the speaker with regard to the topic.
The annotated clips create a semantic web where the proximity of
two clips is proportional to the number of annotations in common.
The annotations permit CyberBELT to present the viewer with a
collage of clips that exhibit contrast with the previous clip.
Two clips that have annotations in common but mismatch on one
annotation make for an interesting contrast. For example, after
seeing a clip where Evelyn Fox Keller regrets the lack of
influence of Cybernetics on molecular biology, the viewer
selects among clips on the influence of Cybernetics on other
disciplines, from opposing points of view, or by different speakers.
As mentioned earlier, the documentary evolves with the feedback from
the viewer. Initially, all clips are equally weighted, or have
equal likelihood to get proposed to the user. As the viewer
selects or interrupts clips and as her gaze is monitored, the
system alters the weights of the clips. When the weight of a clip
changes the weights of all other clips change to the degree that
they have annotations in common. Thus, the weights model the
viewer's preferred concepts, not preferred clips. For example,
if the viewer frequently selects clips with Jay Forrester,
the likelihood of all Forrester clips in the data base increases.
At subsequent decision points in the documentary CyberBELT
is more likely to propose threads with Jay Forrester.
The clip weights, or the model of the viewer, can be saved at
the end of a session and reloaded at the beginning of another.
By loading another viewer's weights, one can watch a documentary
on Cybernetics that reflects the other viewer's preferences.
The system could also model the preferences of a group of viewers.
Figure 4. Picture of user and the CyberBELT interface
On the other hand, an interactive documentary is a well suited
application for multi-modal interaction. A multi-threaded
documentary presents a complex data space to the viewer.
The viewer traverses the space by exploration. She considers
different options without knowing ahead of time exactly where
she wants to go. Since speed is not important in an exploratory
mode, an immersive slow-moving experience through a multi-modal
interface is ideal.
In the future we plan to have people watch the multi-threaded
documentary on Cybernetics and to record their reactions.
Abstract
CyberBELT allows a viewer to interact with a multi-threaded documentary
using a multi-modal interface. The viewer interacts with the documentary
by speaking, pointing and looking around the display. The viewer selects
the threads of the story to follow or lets the system navigate
through the story. Feedback from the viewer evolves the story to present
concepts she is interested in. We discuss the suitability of combining
multi-modal interaction and multi-threaded narrative.
Keywords
multi-modal interaction, interactive documentary, information exploration,
dynamic story-telling system
Scenario
`I walk into the room. On a wall-size display, Seymour Papert tells
with shiny eyes: 'Cybernetics helps us learn about life.' Then
Evelyn Fox Keller, Jay Forrester, Oliver Selfridge and Slavan Gerovitch
appear here and there on the display. As I let my gaze wonder,
the characters unveil text seducing my gaze. I look at Evelyn,
smiling wisely, and the text appears: 'Evelyn Fox Keller gives
her point of view: Cybernetics embraces the complexity found
in nature.' As if knowing I had looked at her, she begins to talk
to me. I am immersed in a sea of conversations, words, and images...'
Introduction
The scenario above is a futuristic view of the interaction with the
CyberBELT system and the documentary on the history of Cybernetics.
Needless to say, the interaction is more cumbersome with current
technology than depicted in the scenario. The viewer needs to wear
an uncomfortable eye-tracker, a microphone and a data glove.
Despite the discomfort of the armor, the multi-modal interaction
allows the viewer to use her whole body to control the flow of
the documentary and to convey her preferences and interests to
the system. This work in progress brings together, to our knowledge
for the first time, multi-modal interaction with an multi-threaded
documentary. We show that a multi-modal interaction suits well
the exploration of threads of a narrative.
Multi-Modal Interaction
Previous work
For a 'full-body' communication with the CyberBELT system, the viewer
uses three technologies: a speech recognizer, an eye-gaze tracker,
and data gloves. Previous work in multi-modal interfaces includes the
early 'Put That There' project [1] where the
viewer used speech and
gesture to manipulate objects on a display. The system resolved
ambiguous references to objects by combining the information from
speech and gesture. Work by Starker [2] used eye-tracking information
to reveal detail about objects or groups of objects gazed at by
the viewer. All three technologies have been integrated in
applications such as [3].
Multi-Modality in CyberBELT
In CyberBELT the viewer uses the three modalities to the extent that
she wishes. She takes on an active or a passive role: she can select
threads of the story and explore the different themes at her own pace
or she can let the system navigate. When presented with a collage
of video icons representing the possible threads to follow,
she selects one by saying "go there" and pointing or looking at the
appropriate icon.

Multi-Threaded Narrative
Previous Work
Working with digital video permits flexible manipulation of clips
and personalization of a narrative. Non-linear video stories
complement classical linear film as they allow the viewer to
explore multiple intertwined threads and characters.
Aspen [4], an early interactive project, was a
surrogate travel experience
through the city of Aspen. The viewer navigated through photos
and video clips of the city by touching the screen or moving a
joystick. The transition between two clips was allowed only
if the corresponding sites were adjacent in the real city.
In Portraits of People Living with Aids [5]
the viewer explores the different themes in the data base of
interview clips. While the transitions between clips are
predefined and static, the documentary is dynamic because
viewers can record comments addressed to the characters.
In other interactive documentaries the sequencing of clips
is dynamic. For example, Train of Thought [6]
uses filters to select scenes from a video data base and
fills a story template.
Multi-Threaded Narrative in CyberBELT
Like Trains of Thought CyberBELT dynamically selects
clips to fill a pre-defined narrative structure, however,
the viewer's choices affect the ongoing story. We divided the
story into chapters to assure a coherent progression from one
theme to another and to give a broad view on the topic even
to the viewer who follows the shortest thread and watches only
one clip from each chapter. While the overall structure
is pre-defined, the sequencing of clips within a chapter is dynamic.

Conclusion
Multi-modal interaction is a suitable way to interact with a
multi-threaded documentary. It facilitates controlling the
documentary and modeling the viewer's interests. With speech
and gesture the viewer can express in a natural way her
desires to the system. Whereas a mouse interface could replace
speaking and pointing, no technology could replace eye-tracking
to provide passive feedback about the user's focus of attention.
Eye-tracking enables CyberBELT to know what to show next to the
viewer and which items on the display to expand with an explanation.
Eye-tracking is particularly useful in a multi-threaded documentary
where the viewer's options are spatially distributed on the display.
Acknowledgements
CyberBELT was developed under the direction of Professors
Ken Haase and
Glorianna Davenport
as part of "Seminar in Storyteller Systems" taught
during the Spring Semester of 1994 at the MIT Media Laboratory.
The eye-tracking and gesture
technology used in CyberBELT was developed by the Advanced Human
Interface Group under the direction of Dr. Richard Bolt at
the MIT Media Laboratory. The authors would like to thank the people
we interviewed: Evelyn Fox Keller, Jay Forrester, Oliver Selfridge,
Seymour Papert and Slavan Gerovitch who were very generous
with their time and are a source of inspiration