![]() |
|
Michael J. Witbrock
Michael G. Christel
For the huge amounts of audio and video material that could usefully be included in digital libraries, the cost of producing human-generated annotations and meta-data is prohibitive. In the Informedia Digital Video Library, the production of meta-data supporting the library interface is automated using techniques from Artificial Intelligence (AI). By applying speech recognition, natural language processing and image analysis, the interface helps users locate the information they want and navigate or browse the digital video library more effectively. Specific AI-based interface components include automatic titles, filmstrips, video skims, word location marking and representative frames for shots.
video browsing, information retrieval interfaces, speech recognition, News-On-Demand, multimedia indexing and search, Informedia, artificial intelligence, automatic text summarization, video summarization, digital library.
© 1997 Copyright on this material is held by the authors.
The Informedia Digital Video Library Project [3] at Carnegie Mellon University is creating a digital library of text, image, video and audio data whose entire content can be searched for rapid retrieval of relevant material. Through the integration of technologies from the fields of natural language understanding, image processing, speech recognition and video compression, the Informedia System [3] allows a user to explore multimedia data both in depth and in breadth. An overview of the structure of the Informedia system is shown in Figure 1.
News-on-Demand [2] is a particular collection in the Informedia Digital Library that has served as a test-bed for automatic library creation techniques. While the main Informedia library is constructed with sufficient human involvement to ensure high quality, in News-on-Demand complete automation is the principal goal. Motivated by the timeliness required by news data, and the volume of material to be indexed, we have applied speech recognition to the creation of a fully content-indexed library and to interactive querying. While this work is centered around processing news stories from TV broadcasts, the system exemplifies an approach that can make any video, audio or text data more accessible.

Figure 1. Overview of the Informedia Digital Video Library System
The AI techniques enable the interface to support rapid and accurate search of imperfect news data; imagine a user who says to the system "Tell me about Chinese dissident Harry Wu." The system searches and retrieves the best twenty-four news stories that match the query. Moving the mouse over the representative poster frames extracted from the stories [1] causes a text summary headline "Trial appeared television tonight, Wu, head bowed, weak" to appear. Another story poster has the headline "Formally protested eruption Sino relations detention right". Although imperfect, these summaries allow the user to select the story of greater interest, in this case the first one, which is clearly about the rationale the Chinese courts used to convict Wu. Clicking on the poster frame starts the video of the story playing. Underneath the video window is a bar with colored lines showing the exact time at which every query term was spoken. Clicking on the word 'Wu' in the query highlights the bars representing that word, and the user can click a button to skip past introductory material to the exact place where Mr. Wu's name was spoken.
If the video clip had been longer, the user could have switched to a "filmstrip" view of the story, where every shot is represented by one frame. Again occurrences of the query words are marked on the filmstrip exactly where they occur. Alternatively, the user might elect to play a video "skim" of the story (as proposed in [3]), where only the most important sections of the story are presented taking a fraction of the original time.
AI techniques from image understanding, information retrieval, speech recognition and natural language processing are incorporated into the interface:
IMAGE PROCESSING FOR SHOT BREAKS AND REPRESENTATIVE FRAMES. Color histogram and Lucas-Kanade optical flow analysis are applied to the MPEG-encoded video. This enables the software to identify editing effects such as cuts and pans that mark shot changes. A single representative frame from each shot is chosen for use in poster frames or in the filmstrip view. Longer segments, such as interviews, are represented by a series of frame images, to indicate the passage of time.
SPEECH RECOGNITION FOR TRANSCRIPT CREATION AND ALIGNMENT. The CMU Sphinx-II continuous, speaker-independent speech recognition system is run against the audio track extracted from the MPEG of the show, or captured from and Dynamic Time Warping (DTW) alignment. This process correctly recognizes between twenty and seventy percent of the words, depending on the content, with a whole-show average of fifty percent. The speech recognizer identifies the time at which words are spoken with 10ms precision. If another transcript, such as broadcast closed captioning, is available for the story, it is automatically aligned with the speech recognizer output using DTW, enabling the timings from the recognizer to be applied to the more accurate closed captioned transcriptions. These timings are stored in the index, and enable the word-precise navigation supported by the interface. If no independent transcript is available, the speech recognition results are used directly for indexing. Information retrieval accuracy is still high in this case, showing a seventy to eighty percent overlap with perfect transcripts in identifying the top ranked story.
NATURAL LANGUAGE PROCESSING FOR SEMANTIC WEIGHTING: We generate both the text headlines and the video "skims" by extractive summarization. Stories are scanned for words that have a high inverse document frequency, and that are strongly distinguishing stories, as determined by a chi-square measure. In the case of the text headlines, a fixed length set of highly weighted terms from early in the story are used. In the case of skims, first a high-scoring word is chosen, then surrounding words are added, such that the score is maximized, until a segment long enough for playback is marked. Further segments are chosen until the desired skim duration is reached.
Judicious combination of a variety of AI techniques has enabled us to construct an effective interface to a digital video library. Speech recognition is used for transcription and alignment, image processing is used for shot analysis and to identify representative frames, and natural language processing is used for summarization. Despite the imperfections in each of these techniques, and the problems inherent in processing unmodified broadcast news data, strong navigation tools supported by the use of AI allow the user to quickly retrieve appropriate stories from the Informedia Digital Video Library.
The authors are grateful for the help of Mosur Ravishankar and our colleagues in the CMU speech group, Michael Smith, for the image processing code, Ricky Houghton for programming support, and Howard Wactlar for fearless leadership. This work is supported in part by the National Science Foundation, ARPA, and NASA under NSF Cooperative Agreement No. IRI-9411299.
1.Hauptmann, A.G. and Smith, M.A. Text, Speech and Vision for Video Segmentation: the Informedia Project. AAAI Fall Symposium on Computational Models for Integrating Language and Vision, Boston MA Nov 10-12 1995.
2.Hauptmann, A.G. and Witbrock, M.J., Informedia News on Demand: Multimedia Information Acquisition and Retrieval, in Maybury, M.T., Ed, Intelligent Multimedia Information Retrieval, AAAI Press/MIT Press, Menlo Park, 1996 (In Press).
3.Wactlar, H.D., Kanade, T., Smith, M.A. and Stevens, S.M., Intelligent
Access to Digital Video: Informedia Project. IEEE Computer, 29
(5), May 1996, 46-52. See also http://www.informedia.cs.cmu.edu/.
![]() |
|