{"id":1140429,"date":"2025-06-05T09:00:00","date_gmt":"2025-06-05T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1140429"},"modified":"2025-06-17T09:21:38","modified_gmt":"2025-06-17T16:21:38","slug":"benchmarkqed-automated-benchmarking-of-rag-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/benchmarkqed-automated-benchmarking-of-rag-systems\/","title":{"rendered":"BenchmarkQED: Automated benchmarking of RAG systems"},"content":{"rendered":"\n
\"Diagram<\/figure>\n\n\n\n

One of the key use cases for generative AI involves answering questions over private datasets, with retrieval-augmented generation (RAG) as the go-to framework. As new RAG techniques emerge, there\u2019s a growing need to benchmark their performance across diverse datasets and metrics. <\/p>\n\n\n\n

To meet this need, we\u2019re introducing BenchmarkQED, a new suite of tools that automates RAG benchmarking at scale, available on GitHub (opens in new tab)<\/span><\/a>. It includes components for query generation, evaluation, and dataset preparation, each designed to support rigorous, reproducible testing.  <\/p>\n\n\n\n

BenchmarkQED complements the RAG methods in our open-source GraphRAG library, enabling users to run a GraphRAG<\/a>-style evaluation across models, metrics, and datasets. GraphRAG uses a large language model (LLM) to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale tasks. <\/p>\n\n\n\n

In this post, we walk through the core components of BenchmarkQED that contribute to the overall benchmarking process. We also share some of the latest benchmark results comparing our LazyGraphRAG<\/a> system to competing methods, including a vector-based RAG with a 1M-token context window, where the leading LazyGraphRAG configuration showed significant win rates across all combinations of quality metrics and query classes.<\/p>\n\n\n\n

In the\u00a0paper<\/a>, we\u00a0distinguish\u00a0between\u00a0local\u00a0queries<\/em>,\u00a0where\u00a0answers\u00a0are\u00a0found\u00a0in a\u00a0small number\u00a0of text regions, and sometimes even\u00a0a single region,\u00a0and\u00a0global\u00a0queries<\/em>, which require reasoning over\u00a0large\u00a0portions\u00a0of\u00a0or even the entire\u00a0dataset.\u00a0<\/p>\n\n\n\n

Conventional vector-based RAG excels at local queries because the regions containing the answer to the query resemble the query itself and can be retrieved as the nearest neighbor in the vector space of text embeddings. However, it struggles with global questions, such as, \u201cWhat are the main themes of the dataset?\u201d which require understanding dataset qualities not explicitly stated in the text.  <\/p>\n\n\n\n

AutoQ: Automated query synthesis<\/h2>\n\n\n\n

This limitation motivated the development of GraphRAG a system designed to answer global queries. GraphRAG’s evaluation requirements subsequently led to the creation of AutoQ, a method for synthesizing these global queries for any dataset.<\/p>\n\n\n\n

AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).<\/p>\n\n\n\n

\"Diagram
Figure 1. Construction of a 2×2 design space for synthetic query generation with AutoQ, showing how the four resulting query classes map onto the local-global query spectrum. <\/figcaption><\/figure>\n\n\n\n

AutoQ can be configured to generate any number and distribution of synthetic queries along these classes, enabling consistent benchmarking across datasets without requiring user customization. Figure 2 shows the synthesis process and sample queries from each class, using an AP News dataset.<\/p>\n\n\n\n

\"Diagram
Figure 2. Synthesis process and example query for each of the four AutoQ query classes. <\/figcaption><\/figure>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"\"\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

AutoE: Automated evaluation framework <\/h2>\n\n\n\n

Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:<\/p>\n\n\n\n