{"id":184391,"date":"2004-09-16T00:00:00","date_gmt":"2009-10-31T13:42:21","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/msr-research-item\/evaluating-retrieval-system-effectiveness\/"},"modified":"2016-09-09T09:59:09","modified_gmt":"2016-09-09T16:59:09","slug":"evaluating-retrieval-system-effectiveness","status":"publish","type":"msr-video","link":"https:\/\/www.microsoft.com\/en-us\/research\/video\/evaluating-retrieval-system-effectiveness\/","title":{"rendered":"Evaluating Retrieval System Effectiveness"},"content":{"rendered":"
\n

One of the primary motivations for the Text REtrieval Conference (TREC) was to standardize retrieval system evaluation. While the Cranfield paradigm of using test collections to compare system output had been introduced decades before the start of TREC, the particulars of how it was implemented differed across researchers making evaluation results incomparable. The validity of test collections as a research tool was in question, not only from those who objected to the reliance on relevance judgments, but also from those who were concerned as to how they could scale. With the notable exception of Sparck Jones and van Rijsbergen’s report on the need for larger, better test collections, there was little explicit discussion of what constituted a minimally acceptable experimental design and no hard evidence to support any position.<\/p>\n

TREC has succeeded in standardizing and validating the use of test collections as a retrieval research tool. The repository of different runs using a common collection that have been submitted to TREC enabled the empirical determination of the confidence that can be placed in a conclusion that one system is better than another based on the experimental design. In particular, the reliability of the conclusion has been shown to depend critically on both the evaluation measure and the number of questions used in the experiment.<\/p>\n

This talk summarizes the results of two more recent investigations based on the TREC data: the definition of a new measure, and evaluation methodologies that look beyond average effectiveness.
\nThe new measure, named “bpref” for binary preferences, is as stable as existing measures, but is much more robust in the face of incomplete relevance judgments, so it can be used in environments where complete judgments are not possible.
\nUsing average effectiveness scores hampers failure analysis because the averages hide an enormous amount of variance, yet more focused evaluations are unstable precisely because of that variation.<\/p>\n<\/div>\n

<\/p>\n","protected":false},"excerpt":{"rendered":"

One of the primary motivations for the Text REtrieval Conference (TREC) was to standardize retrieval system evaluation. While the Cranfield paradigm of using test collections to compare system output had been introduced decades before the start of TREC, the particulars of how it was implemented differed across researchers making evaluation results incomparable. The validity of […]<\/p>\n","protected":false},"featured_media":290480,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[],"msr-video-type":[],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-184391","msr-video","type-msr-video","status-publish","has-post-thumbnail","hentry","msr-locale-en_us"],"msr_download_urls":"","msr_external_url":"https:\/\/youtu.be\/Tw4guy9X8U0","msr_secondary_video_url":"","msr_video_file":"","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/184391"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-video"}],"version-history":[{"count":0,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video\/184391\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/290480"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=184391"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=184391"},{"taxonomy":"msr-video-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-video-type?post=184391"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=184391"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=184391"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=184391"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}