Controlling Overlap in Content-Oriented XML Retrieval

The representation of documents in XML provides an opportunity for information retrieval systems to take advantage of document structure, returning individual document components when appropriate, rather than complete documents in all circumstances. In response to a user query, an XML information retrieval system might return a mixture of paragraphs, sections, articles, bibliographic entries and other components. This facility is of particular benefit when a collection contains very long documents, such as product manuals or books, where the user should be directed to the most relevant portions of these documents.

The direct application of standard ranking techniques to retrieve individual elements from a collection of XML documents often produces a result set in which the top ranks are dominated by a large number of elements taken from a small number of highly relevant documents. This paper presents and evaluates an algorithm that re-ranks this result set, with the aim of minimizing redundant content while preserving the benefits of element retrieval, including the benefit of identifying topic-focused components contained within relevant documents. Test collections developed by the INitiative for the Evaluation of XML Retrieval (INEX) form the basis for the evaluation.

Speaker Details

Charlie Clarke is an Associate Professor in the School of Computer Science at the University of Waterloo. His research interests include information storage and retrieval, software development tools, and programming language implementation. Charlie received his Ph.D. from Waterloo in 1996. From 1996 to 1999 he was an Assistant Professor in the Department of Electrical and Computer Engineering at the University of Toronto. He has previously held software development positions at a number of computer consulting and engineering firms.

Date:
Speakers:
Charlie Clarke
Affiliation:
University of Waterloo