Abstract
Over the past two years, the CHI conference committee has tried to improve the usability of the conference
proceedings
through improving the index. Latent Semantic Indexing, a statistically-based retrieval method, was used to
analyze the titles
and abstracts of papers and suggest additional relevant keywords not provided by the authors. This poster
describes the
method for generating the indices and shows how it can be used as a general approach for improving access
to paper-based
documents.
Keywords:
Indexing, Information Retrieval, Latent Semantic Analysis, keywords, Paper-based
documents.
Introduction
While the CHI conference focuses on issues of the interactions between humans and computers, it also
addresses issues of
interactions between humans and more primitive information systems, such as conference proceedings.
Over the past two
years, the CHIconference committee has tried to improve the usability of the conference proceedings by
improving the index.
For CHI ‘92 and ‘93, the proceedings had no keywordindex. This can be due to the fact that generating an
index is both a
time consuming process and difficult to do well. For CHI ‘94 and ‘95, an information retrieval method was
used for
indexing that permitted both automating the indexing and developing a better index to the CHI proceedings
Human Indexing
The keywords used in a proceedings index are typically the keywords chosen by the authors. However,
people are not very
good at generating good descriptors about their research. Studies of generating keywords show that
different people use the
same term to describe a concept only about 20% of the time [6], Indeed, even trained indexers seldom
generate the same
keywords for a concept [8]. Thus, authors of conference papers will choose only a small sample of the
words that best
describe the topic of their paper. People searching under a particular term in the index may not pick the
same terms
chosenby the author and therefore not notice the paper.
In addition, indexing is a laborious process. It requires looking through the texts to identify relevant
keywords from the texts
and using knowledge of the domain in order to generate additional keywords not used in the original texts.
In this
research,we applied Latent Semantic Indexing (LSI), a method that models the semantics of the domain, in
order to suggest
additional relevant keywords.
Latent Semantic Indexing
LSI [2,3] is an information retrieval method that models the association between terms and documents
based on how terms
co-occur across documents. The method captures the higher order "latent" structure of word usage across
the documents
rather than just surface level word choice. This permits a characterization of the association between words
and documents,
even if a particular document does not contain those words. The analysis performed by LSI can be
interpreted as a high-
dimensional semantic space, in which terms and documents are represented as vectors in the space. Cosines
between these
vectors represent their predicted similarity.
LSI is therefore able to determine the semantic similarity between any paper and the words related to it,
even if the words
were not used in the original paper. LSI has been used for a variety of applications including, information
retrieval [2],
information filtering [5], and choosing reviewers for CHI and Hypertext conference papers [4].
METHOD
The text contained in the titles, keywords and abstracts from the CHI conference papers was used for
developing the index.
Because LSI initially does not identify multiple-word keywords, a parts-of-speech tagger [1] identified all
the noun phrases
in the texts. These noun phrases were then pared back by hand to a smaller subset of "relevant" Human-
Computer
Interaction phrases.
In order to generate a large semantic space on HCI related terms, Perlman's HCIBIB collection [7] was
scaled using LSI.
This resulted in a 300 dimensional semantic space made up of 8530abstracts by 15998 unique words and
phrases. The text
from each paper to be indexed was then placed in this space based on words used in the text. The closest 50
words and
phrases in the semantic space to each text was then selected, resulting in a listof 2514 unique words and
phrases. This list
was then pared down again by hand to words and phrases most relevant to an HCI index and highly
semantically similar
words (e.g. GOMS analysis, GOMS modeling) were combined into single entries in the index.
EVALUATION
For the CHI ‘94 indexing, this method generated many additional words that were not originally suggested
by the authors.
An examination of the results of the indexing illustrates examples of both successes and errors generated by
using this
method. Examples of successes include that the method suggested Psychophysics for a paper
titled, "An image
retrieval system considering subjective perception". and suggested Blind users for a paper on
an auditory
enhanced scrollbar. However, there are two types of failures that can also occur using this method, misses
and false alarms.
Misses refer to cases where there were additional appropriate keywords that could have been suggested but
were not
generated by the method. An example of a miss was when the method did not suggest Cognitive
model for a
paper comparing two cognitive architectures, although it did suggest this keyword for other related papers.
The other type of
error, false alarms, refers to cases where the method suggested a keyword that was notappropriate. An
example was that the
method suggested SOAR as a keyword for a paper on cognitive modeling but that was not specifically
about the SOAR
model. While false alarms are easier to identify by hand and remove, misses are much more difficult to
identify in general,
since they require human skills of generating additional keywords.
While no formal evaluation was performed on users of the index, comments from attendees at CHI 94
indicated that index
was found to be useful. Nevertheless, the total time for generating the CHI ‘94 proceedings and conference
companion
indices was approximately 25 man-hours. This would likely be equivalent to the amount of time taken if the
indices had been
created by hand. However, much of this time was spent in the development of the software tools to conduct
the indexing.
As the tools now have been developed, we will be better able to judge the amount of effort required using
this method for
indexing the CHI ‘95 papers. At the time of this writing, the indexing for the CHI ‘95 papers has not been
performed since
the deadline for final papers is in January. In addition to a characterization of the amount of time to index
the proceedings, a
more complete analysis will be performed on the index generated for CHI ‘95. This will include statistics
on how many
additional relevant keywords were suggested and a characterization of the number of false alarms that were
later removed by
hand.
CONCLUSIONS
Potential as a method
LSI captures the semantics of the CHI domainin a manner similar to that of experts in the field. This permits
the method to
suggest relevant keywords not provided by authors. The method is not perfect, generating some misses and
false-alarms.
Nevertheless, it eases the burden of the indexer in generating additional keywords. A Macintosh-based
program has been
developed that displays the abstract and title of a paper along with a list of additional computer-generated
suggested
keywords. The interface permits indexers to select words from the screen or type in their own. These
words are then
incorporated into the final index for the proceedings.
Hand vs. Automatic Indexing?
Using LSI for indexing still involves some amount of human processing. Humans must still choose relevant
noun phrases,
select the best of the terms suggested by LSI, and combine highly semantically similar concepts together.
However, this
indexing method automates one of the more difficult and unreliable aspects of indexing, generating
additional relevant
keywords. For the CHI conference, which focuses on issues of usability, it is important to provide easy
access to its
proceedings. LSI appears to be a promising approach to improving the usability of these paper-based
documents.
Acknowledgments
The author thanks Tom Landauer, Susan Dumais, Adrienne Lee, and Steve Abney for advice and help on
the indexing
methods. He also acknowledges the contributions of CHI committee members, Irvin Katz, Rick Gondella,
Catherine Plaisant,
and Beth Adelson for help with the documents.
References
1. Abney, S. A computational model of human parsing. Journal of Psycholinguistic Research, 18(1), (1989),
129-144
2. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. Indexing by Latent
Semantic Analysis.
Journal of the American Society for Information Science, 41,6, (1990), 391-407.
3. Dumais, S. T., Furnas, G. W., Landauer, T. K., & Deerwester, S. & Harshman, R. Using latent semantic
analysis to
improve information retrieval. in CHI ‘88 Conference Proceedings: Human Factors in Computing systems
(1988), (pp.
281-285). New York: ACM.
4. Dumais, S. T., & Nielsen, J. Automating the assignment of submitted manuscripts to reviewers. In
Proceedings of the
ACM SIGIR ‘92 Conference, Copenhagen, Denmark, (1992).
5. Foltz, P. W. & Dumais, S. T. Personalized information delivery: An analysis of filtering methods.
Communications of the
ACM, 35(12), (1992), 51-60.
6. Furnas, G. W., Landauer T. K., Gomez, L. M. & Dumais, S. T. The Vocabulary problem in Human-
System
Communication. Communications of the ACM, 30,11, (1987), 964-971.
7. Perlman, G. The HCI Bibliography Project. SIGCHI Bulletin, 23,3, (1991), 15-20.
8. Tarr, D. & Borko, H. Factors influencing inter-indexer consistency. In Proceedings of the ASIS 37th
Annual Meeting, Vol.
11, (974), 50-55.