À propos
Bhaskar Mitra is a Principal Researcher at Microsoft Research based in Montreal, Canada. Bhaskar’s research focuses on AI-mediated information and knowledge access. His research interests span model and system development, evaluation and benchmarking, and fairness and ethics in the context of these sociotechnical systems. Before joining Microsoft Research, he worked at Bing for 15 years conducting research with strong focus on both academic and product impact. Bhaskar is serving as the ACM SIGIR Community Relations Coordinator (opens in new tab), an Associate Editor for the ACM Transactions on Information System (TOIS) journal (opens in new tab), and on the NIST TREC program committee. Bhaskar is the recipient of two ACM SIGIR 2024 Early Career Researcher Awards (opens in new tab) for excellence in Research and for excellence in Community Engagement. He co-organized the Neural IR Workshops (Neu-IR’16 (opens in new tab) and Neu-IR’17 (opens in new tab)) to bring together an early community of information retrieval researchers interested in deep learning methods, as well as several shared evaluation tasks and community benchmarking efforts including the MS MARCO ranking leaderboards (opens in new tab), the TREC Deep Learning Track (opens in new tab) (2019-2023), and the TREC Tip-of-the-Tongue Track (opens in new tab) (2023-). He received his Ph.D. in Computer Science from University College London under the supervision of Dr. Emine Yilmaz (opens in new tab).
Featured Items
Download: MS MARCO
MS MARCO is a collection of datasets focused on deep learning in search. The first dataset was a question answering dataset featuring 100,000 real Bing questions and a human generated answer. Since then we released a 1,000,000 question dataset, a natural langauge generation dataset, a passage ranking dataset, keyphrase extraction dataset, crawling dataset, and a conversational search.
TREC Track: Deep Learning
The TREC Deep Learning Track studies information retrieval in a large training data regime. This is the case where the number of training queries with at least one positive label is at least in the tens of thousands, if not hundreds of thousands or more. This corresponds to real-world scenarios such as training based on click logs and training based on labels from shallow pools (such as the pooling in the TREC Million Query Track or the evaluation of search engines based on early precision).
Book: An Introduction to Neural Information Retrieval
Neural models have been employed in many Information Retrieval scenarios, including ad-hoc retrieval, recommender systems, multi-media search, and even conversational systems that generate answers in response to natural language questions. An Introduction to Neural Information Retrieval provides a tutorial introduction to neural methods for ranking documents in response to a query, an important IR task. The monograph provides a complete picture of neural information retrieval techniques that culminate in supervised neural learning to rank models including deep neural network architectures that are trained end-to-end for ranking tasks. In reaching this point, the authors cover all the important topics, including the learning to rank framework and an overview of deep neural networks. This monograph provides an accessible, yet comprehensive, overview of the state-of-the-art of Neural Information Retrieval.
TREC Track: Tip-of-the-Tongue
Tip-of-the-tongue (ToT) known-item retrieval is defined as "an item identification task in which the searcher has previously experienced an item but cannot recall a reliable identifier" (i.e., "It’s on the tip of my tongue…"). The TREC ToT track aims to develop IR systems that can successfully resolve ToT information needs. Progress in this area will likely benefit other IR systems that must deal with memory assistance, such as personal information management (PIM) systems (e.g., email re-finding).
PhD Thesis: Neural Methods for Effective, Efficient, and Exposure-Aware Information Retrieval
Neural networks with deep architectures have demonstrated significant performance improvements in computer vision, speech recognition, and natural language processing. The challenges in information retrieval (IR), however, are different from these other application areas. A common form of IR involves ranking of documents---or short passages---in response to keyword-based queries. Effective IR systems must deal with query-document vocabulary mismatch problem, by modeling relationships between different query and document terms and how they indicate relevance. Models should also consider lexical matches when the query contains rare terms---such as a person's name or a product model number---not seen during training, and to avoid retrieving semantically related but irrelevant results. In many real-life IR tasks, the retrieval involves extremely large collections---such as the document index of a commercial Web search engine---containing billions of documents. Efficient IR methods should take advantage of specialized IR data structures, such as inverted index, to efficiently retrieve from large collections. Given an information need, the IR system also mediates how much exposure an information artifact receives by deciding whether it should be displayed, and where it should be positioned, among other results. Exposure-aware IR systems may optimize for additional objectives, besides relevance, such as parity of exposure for retrieved items and content publishers. In this thesis, we present novel neural architectures and methods motivated by the specific needs and challenges of IR tasks.