Selection Bias in the LETOR Datasets

Tom Minka

Selection Bias in the LETOR Datasets

Tom Minka

August 2008

Download BibTex

The LETOR datasets consist of data extracted from traditional IR test corpora. For each of a number of test topics, a set of documents has been extracted, in the form of features of each document-query pair, for use by a ranker. An examination of the ways in which documents were selected for each topic shows that the selection has (for each of the three corpora) a particular bias or skewness. This has some unexpected effects which may considerably in influence any learning-to-rank exercise conducted on these datasets. The problems may be resolvable by modifying the datasets.