Garett Dworman
Operations and Information Management Department
The Wharton School, University of Pennsylvania
Philadelphia, PA 02147
(215)898-7211 dworman@opim.wharton.upenn.edu http://opim.wharton.upenn.edu/~dworman
Traditional research in information retrieval concentrates on
retrieving documents. This paper introduces the idea that valuable
information exists within a document collection as thematic patterns
that can be found without looking at individual documents in the
collection. This information is valuable in its own right and
as an aid to the IR process, and is often not contained in any
of the collection's documents. This paper introduces a pattern
discovery support system, Homer, which aids users' search for
patterns and some compelling anecdotal evidence.
information retrieval, pattern discovery.
Traditional research in information retrieval (IR) concentrates on identifying and retrieving relevant documents. However, when many documents are brought together into a collection, global information emerges from the aggregation of the documents' local information as patterns among the documents. This is akin to the emergent macro behavior of a complex system which is distinct from the behavior of any of its component parts. These emergent patterns may often be of more interest to users than the raw facts contained in the documents. Users often approach document collections with interest in themes and issues, how they relate to each other, and what trends they have. Such relations and trends amongst concepts are the emergent information patterns.
An IR system that can, in addition to locating relevant documents,
help users find emergent patterns would provide an added depth
to the information search process. This research has two goals:
First, to demonstrate that pattern discovery support systems can
help users find patterns in a collection. Second, to demonstrate
that patterns can be interesting, non-trivial, and useful in users'
information search process.
In 1988 Don Swanson [6] discovered a relationship between magnesium levels and migraine headaches in MEDLINE. He did so by locating pairs of articles with related titles such as the following [6]:
The relation of migraine and epilepsy.
The magnesium deficient rat as a model of epilepsy.
Swanson's 1988 study cites 128 articles containing 11 different topics, such as epilepsy, linking migraines and magnesium. This is a pattern of information spread out over at least 128 documents that were related in 11 different ways. Such patterns are not easily found.
Remarkably, none of the sixty-five articles on migraine mentions or cites any articles on magnesium and none of the sixty-three articles on magnesium mentions or cites any articles on migraine. Moreover, among 4,600 migraine records and 38,000 magnesium records, there were only six that contained both "migraine" and "magnesium" [which] were principally on magnesium. In short, neither online searching nor printed indexes nor reading the text and following citation trails in medical articles turned up evidence that there was, at the time, any substantial scientific interest in the possibility of a physiological relationship between magnesium and migraine. [6]
Swanson published three such hypotheses of which two have been medically confirmed [6].
The patterns that Swanson discovered cut across research fields
which were unaware of each other - a situation which is becoming
more common as research specialization increases. A pattern discovery
support system can look for patterns across literatures in a collection
and aid discoveries such as Swanson's. Patterns may also aid users'
information search process by revealing connections between topics
that users had not suspected exist, and by providing a context
for the terms used by the documents in the collection. Patterns
provide a point of interaction with the collection that lets users
see the collection as an entity itself (cf., [3]).
Homer is a decision support tool which aids users' search for patterns, and grew out of research on thematic queries for The Historic New Orleans Collection museum [6]. The prototype presents users with metalevel document summary statistics based upon a relevance ranking algorithm that is sensitive to semantic latency [4]. Homer displays the metainformation along various dimensions in an interactive, spreadsheet-like display. The prototype accesses a collection of 4,085 titles and captions of photographs by Clarence John Laughlin which makes over 78,000 references to 2,045 terms from the Art and Architecture Thesaurus (AAT) by the Getty Museum.
Figure 1 shows a snapshot of the Homer screen investigating the term FANTASY. Each column is a different 5-year subdivision of the collection. At the left in descending order are the terms most related to FANTASY. The numbers in the table are the number of documents in that subdivision that contain both the investigated term and the related term. For example, the subdivision 1935 to 1940 contains 1,464 of the 4,085 documents of which 215 contain the term FANTASY. Of these 215, 41 contain STONE. Notice that the numbers in only the 1935-1940 column descend monotonically. How terms relate to each other changes within each subdivision. Hence the ranking of one column may not be the same as another. Users may resort the table by any column.
Figure 2 summarizes three Homer screens regarding FANTASY. Note that with a few exceptions, the terms most related to FANTASY are construction materials - WROUGHT IRON, CAST IRON, STONE, BRICK. We see here a definite context for the term FANTASY; Laughlin is looking for fantasies within the mundane and physical things around us rather than in the more typical symbols used in literature such as up in the clouds, across the oceans, or into the mountains. One exception to this relationship between FANTASY and materials is the term ZULU. This is the second most related term in the 1935-1940 category, but it vanishes entirely when sorted by another column. Something about "zulu" caught Laughlin's attention in this period when he was pursuing images of fantasy. We can't guess here what Laughlin's interest was - it could be zulu culture, religion, symbolism, or whatever, but, we now have a story that might be interesting to pursue with a traditional IR system.
| Figure 2: | ||||
|
|
|
| |
| Sorted by | ||||
| FANTASY | ||||
| STONE | ||||
| ZULU | ||||
| WROUGHT | ||||
| WROUGHT IRON | ||||
| CAST IRON | ||||
|
|
|
| |
| Sorted by | ||||
| FANTASY | ||||
| STONE | ||||
| CAST IRON | ||||
| WROUGHT | ||||
| DESIGN | ||||
| TOMBS | ||||
|
|
|
| |
| Sorted by | ||||
| FANTASY | ||||
| STONE | ||||
| WROUGHT | ||||
| WROUGHT IRON | ||||
| RAISED | ||||
| CAST IRON | ||||
| BRICK | ||||
| Totals | ||||
A second example is shown in Figure 3 which summarizes a Homer screen for FANTASTIC ARCHITECTURE. Notice first that all FANTASTIC ARCHITECTURE documents contain the terms AMERICAN and VICTORIAN. From this we can hypothesize that Laughlin was enthralled with american victorian architecture, and all his photographs on what he considered to be fantastic architecture were on this particular architectural style. Furthermore, three of the four next terms are of building materials. Thus, we may also hypothesize what materials were used for this architectural style. This second hypothesis comes both from the terms BRICK, CAST IRON, and WOOD, as well as the absence of such terms as STONE or PLASTER. Finally, we can see that Laughlin took no photos of FANTASTIC ARCHITECTURE before 1935, and, because there are only 7 before 1940, we can infer that he probably did not start on this topic until the very late 1930s.
Despite the fact that the document collection used contains no information about Laughlin himself, we were able to derive three important hypothesis about him. The veracity of these three hypotheses is supported by the following statement Laughlin makes of his architectural photographs: "Among the objectives of this group were to show that the 1880s and 90s were probably the most important period architecturally in American cultural history ([2], p. 155)."
| Figure 3 | ||||
|
|
|
| |
| Sorted by | ||||
| fantastic architecture | ||||
| american | ||||
| victorian | ||||
| brick | ||||
| northsest | ||||
| cast iron | ||||
| wood | ||||
Patterns provide a holistic view of the entire collection which can help users perform their information search because it provides a valuable context by which users can evaluate the terms and documents they find; because patterns may stimulate new cognitive connections; and, because interaction with patterns helps users discover their own information needs and more precisely specify a proper query to a traditional document retrieval system.
Anecdotal evidence such as the above are very compelling, but more rigorous evidence is needed. Two experments are currently under way. One experiment will apply Homer to MEDLINE in an attempt to recreate Swanon's discoveries. A second experiment will compare user evaluations of Homer with a more traditional IR system and with a system consisting of both Homer and the traditional IR system. Randomizing the users, the configurations, and several different information search tasks, we can statistically evaluate such factors as users' confidence in their results.
Homer was designed only as a proof-of-concept and is quite crude
with only a simple tabular interface. Pattern discovery is a visually
intensive cognitive process and should combine the statistical
search for patterns with proper visual interfaces that help reveal
those patterns. Therefore, future research will explore better
designs using various information visualization techniques - e.g.,
viewing filters [4] or table lens [7] - as front ends for a deliverable
pattern discovery support system.