{"id":1140429,"date":"2025-06-05T09:00:00","date_gmt":"2025-06-05T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1140429"},"modified":"2025-06-17T09:21:38","modified_gmt":"2025-06-17T16:21:38","slug":"benchmarkqed-automated-benchmarking-of-rag-systems","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/benchmarkqed-automated-benchmarking-of-rag-systems\/","title":{"rendered":"BenchmarkQED: Automated benchmarking of RAG systems"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1.jpg\" alt=\"Diagram showing how the dimensions of query source (data-driven vs activity-driven) and query scope (local vs global) create four query classes that span the local-to-global query spectrum: data-local, activity-local, data-global, and activity-global. \" class=\"wp-image-1140721\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>One of the key&nbsp;use cases&nbsp;for generative AI&nbsp;involves answering questions over&nbsp;private datasets,&nbsp;with retrieval-augmented generation (RAG)&nbsp;as the&nbsp;go-to framework.&nbsp;As&nbsp;new RAG&nbsp;techniques&nbsp;emerge,&nbsp;there\u2019s&nbsp;a growing&nbsp;need to benchmark&nbsp;their performance&nbsp;across&nbsp;diverse&nbsp;datasets&nbsp;and&nbsp;metrics.&nbsp;<\/p>\n\n\n\n<p>To meet this need,&nbsp;we\u2019re&nbsp;introducing&nbsp;BenchmarkQED,&nbsp;a new suite of tools&nbsp;that&nbsp;automates&nbsp;RAG benchmarking at&nbsp;scale, available on&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/benchmark-qed\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>. It includes components for&nbsp;query generation, evaluation, and dataset preparation, each&nbsp;designed to support rigorous, reproducible&nbsp;testing.&nbsp;&nbsp;<\/p>\n\n\n\n<p>BenchmarkQED&nbsp;complements the RAG methods in our open-source&nbsp;GraphRAG&nbsp;library, enabling users to run a&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/from-local-to-global-a-graph-rag-approach-to-query-focused-summarization\/\" target=\"_blank\" rel=\"noreferrer noopener\">GraphRAG<\/a>-style evaluation across models, metrics, and datasets.&nbsp;GraphRAG&nbsp;uses a&nbsp;large&nbsp;language model (LLM)&nbsp;to generate and summarize entity-based knowledge graphs, producing more comprehensive and diverse answers than standard RAG for large-scale&nbsp;tasks.&nbsp;<\/p>\n\n\n\n<p>In this post, we walk through&nbsp;the core components of&nbsp;BenchmarkQED&nbsp;that&nbsp;contribute to the overall benchmarking process.&nbsp;We also share some of the latest benchmark results comparing our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/lazygraphrag-setting-a-new-standard-for-quality-and-cost\/\" target=\"_blank\" rel=\"noreferrer noopener\">LazyGraphRAG<\/a> system&nbsp;to&nbsp;competing methods,&nbsp;including&nbsp;a vector-based&nbsp;RAG with a 1M-token context window, where the leading&nbsp;LazyGraphRAG&nbsp;configuration&nbsp;showed&nbsp;significant win rates&nbsp;across all combinations of quality metrics&nbsp;and query classes.<\/p>\n\n\n\n<p>In the\u00a0<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/from-local-to-global-a-graph-rag-approach-to-query-focused-summarization\/\" target=\"_blank\" rel=\"noreferrer noopener\">paper<\/a>, we\u00a0distinguish\u00a0between\u00a0<em>local\u00a0queries<\/em>,\u00a0where\u00a0answers\u00a0are\u00a0found\u00a0in a\u00a0small number\u00a0of text regions, and sometimes even\u00a0a single region,\u00a0and\u00a0<em>global\u00a0queries<\/em>, which require reasoning over\u00a0large\u00a0portions\u00a0of\u00a0or even the entire\u00a0dataset.\u00a0<\/p>\n\n\n\n<p>Conventional&nbsp;vector-based RAG&nbsp;excels at&nbsp;local queries because the&nbsp;regions&nbsp;containing&nbsp;the&nbsp;answer&nbsp;to the query&nbsp;resemble the&nbsp;query&nbsp;itself&nbsp;and can be retrieved&nbsp;as&nbsp;the&nbsp;nearest neighbor in the&nbsp;vector&nbsp;space&nbsp;of text embeddings.&nbsp;However, it struggles&nbsp;with&nbsp;global questions,&nbsp;such as, \u201cWhat are the main themes of the dataset?\u201d which&nbsp;require&nbsp;understanding&nbsp;dataset qualities not&nbsp;explicitly&nbsp;stated&nbsp;in&nbsp;the text.&nbsp;&nbsp;<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"autoq-automated-query-synthesis\">AutoQ: Automated query synthesis<\/h2>\n\n\n\n<p>This limitation&nbsp;motivated&nbsp;the development of&nbsp;GraphRAG&nbsp;a system designed to&nbsp;answer&nbsp;global queries. GraphRAG&#8217;s evaluation&nbsp;requirements&nbsp;subsequently&nbsp;led to the creation of&nbsp;AutoQ, a method for synthesizing these global queries for any dataset.<\/p>\n\n\n\n<p>AutoQ extends this approach by generating synthetic queries across the spectrum of queries, from local to global. It defines four distinct classes based on the source and scope of the query (Figure 1, top) forming a logical progression along the spectrum (Figure 1, bottom).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"4026\" height=\"2117\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4.png\" alt=\"Diagram showing how the dimensions of query source (data-driven vs activity-driven) and query scope (local vs global) create four query classes that span the local-to-global query spectrum: data-local, activity-local, data-global, and activity-global. \" class=\"wp-image-1140725\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4.png 4026w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-300x158.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-1024x538.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-768x404.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-1536x808.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-2048x1077.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure1-4-240x126.png 240w\" sizes=\"auto, (max-width: 4026px) 100vw, 4026px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Construction of a 2&#215;2 design space for synthetic query generation with&nbsp;AutoQ, showing how the four resulting query classes map onto the local-global query spectrum.&nbsp;<\/figcaption><\/figure>\n\n\n\n<p>AutoQ&nbsp;can be configured to generate&nbsp;any number and distribution of synthetic queries&nbsp;along these&nbsp;classes, enabling consistent benchmarking&nbsp;across datasets without&nbsp;requiring user&nbsp;customization.&nbsp;Figure 2 shows the synthesis process and sample&nbsp;queries&nbsp;from each class, using&nbsp;an AP News dataset.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"4162\" height=\"2117\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3.png\" alt=\"Diagram showing the processes for synthesizing queries in each of the four classes. Each process involves steps like generating dataset summaries, personas, tasks, and candidate queries, followed by clustering candidate queries and selecting the final query set. The data-local example query is \u201cWhy are junior doctors in South Korea striking in February 2024?\u201d. The activity-local example query is \u201cWhat are the public health implications of the newly discovered Alaskapox virus in Alaska?\u201d. The data-global example query is \u201cAcross the dataset, what are the key public health challenges and the measures being taken to address them?\u201d. The activity-global example query is \u201cAcross the dataset, what are the main public health initiatives mentioned that target underserved communities?\u201d.\" class=\"wp-image-1140728\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3.png 4162w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-300x153.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-1024x521.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-768x391.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-1536x781.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-2048x1042.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure2-3-240x122.png 240w\" sizes=\"auto, (max-width: 4162px) 100vw, 4162px\" \/><figcaption class=\"wp-element-caption\">Figure 2. Synthesis process and example query for each of the&nbsp;four&nbsp;AutoQ&nbsp;query classes.&nbsp;<\/figcaption><\/figure>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1002645\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: AI-POWERED EXPERIENCE<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-label=\"Microsoft research copilot experience\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/01\/MSR-Chat-Promo.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft research copilot experience<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"microsoft-research-copilot-experience\" class=\"large\">Discover more about research at Microsoft through our AI-powered experience<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/aka.ms\/research-copilot\/?OCID=msr_researchforum_Copilot_MCR_Blog_Promo\" aria-describedby=\"microsoft-research-copilot-experience\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"Microsoft research copilot experience\" target=\"_blank\">\n\t\t\t\t\t\t\tStart now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"autoe-automated-evaluation-framework\">AutoE: Automated evaluation&nbsp;framework&nbsp;<\/h2>\n\n\n\n<p>Our evaluation of GraphRAG focused on analyzing key qualities of answers to global questions. The following qualities were used for the current evaluation:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Comprehensiveness<\/strong>:&nbsp;Does the&nbsp;answer&nbsp;address&nbsp;all&nbsp;relevant&nbsp;aspects&nbsp;of the question?&nbsp;<\/li>\n\n\n\n<li><strong>Diversity<\/strong>:&nbsp;Does it present&nbsp;varied perspectives&nbsp;or&nbsp;insights?&nbsp;<\/li>\n\n\n\n<li><strong>Empowerment<\/strong>:&nbsp;Does it&nbsp;help the reader understand and make informed judgments?&nbsp;<\/li>\n\n\n\n<li><strong>Relevance<\/strong>:&nbsp;Does&nbsp;it&nbsp;address what the question is specifically asking?&nbsp;&nbsp;<\/li>\n<\/ul>\n\n\n\n<p>The&nbsp;AutoE&nbsp;component&nbsp;scales&nbsp;evaluation of&nbsp;these qualities&nbsp;using the LLM-as-a-Judge method.&nbsp;It&nbsp;presents&nbsp;pairs of answers to an LLM, along with the query and target metric,&nbsp;in counterbalanced order.&nbsp;The&nbsp;model determines whether the first answer wins, loses, or ties with the second.&nbsp;Over&nbsp;a set of queries, whether&nbsp;from&nbsp;AutoQ&nbsp;or elsewhere, this&nbsp;produces&nbsp;win rates between&nbsp;competing&nbsp;methods. When&nbsp;ground truth&nbsp;is available, AutoE&nbsp;can also score answers on&nbsp;correctness, completeness, and&nbsp;related metrics.<\/p>\n\n\n\n<p>An illustrative&nbsp;evaluation&nbsp;is&nbsp;shown in Figure&nbsp;3.&nbsp;Using a dataset of&nbsp;1,397&nbsp;AP News&nbsp;articles&nbsp;on&nbsp;health and healthcare, AutoQ&nbsp;generated&nbsp;50 queries&nbsp;per&nbsp;class&nbsp;(200&nbsp;total).&nbsp;AutoE&nbsp;then&nbsp;compared&nbsp;LazyGraphRAG&nbsp;to&nbsp;a competing&nbsp;RAG method, running six&nbsp;trials&nbsp;per&nbsp;query&nbsp;across four&nbsp;metrics,&nbsp;using&nbsp;GPT-4.1 as a judge.<\/p>\n\n\n\n<p>These&nbsp;trial-level results were aggregated&nbsp;using metric-based win rates,&nbsp;where each trial is scored 1 for a win, 0.5 for a tie, and 0 for a loss,&nbsp;and then averaged to calculate the overall win rate for each RAG method.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"1000\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4.png\" alt=\"Bar charts with the y-axes representing win rates for LazyGraphRAG conditions. The x-axes contain a range of comparison conditions. Bars are clustered by LazyGraphRAG (LGR) condition and charts are faceted by query class. \" class=\"wp-image-1140915\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4.png 1200w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4-300x250.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4-1024x853.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4-768x640.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/06\/Figure3-4-216x180.png 216w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Figure 3. Win rates of four&nbsp;LazyGraphRAG&nbsp;(LGR) configurations across methods, broken down by the&nbsp;AutoQ&nbsp;query class and averaged across&nbsp;AutoE\u2019s&nbsp;four metrics: comprehensiveness, diversity, empowerment, and relevance.&nbsp;LazyGraphRAG&nbsp;outperforms&nbsp;comparison conditions&nbsp;where the bar&nbsp;is above&nbsp;50%.<\/figcaption><\/figure>\n\n\n\n<p>The four LazyGraphRAG conditions (LGR_b200_c200, LGR_b50_c200, LGR_b50_c600, LGR_b200_c200_mini) differ by query budget (b50, b200) and chunk size (c200, c600). All used GPT-4o mini for relevance tests and GPT-4o for query expansion (to five subqueries) and answer generation, except for LGR_b200_c200_mini, which used GPT-4o mini throughout.<\/p>\n\n\n\n<p>Comparison&nbsp;systems&nbsp;were&nbsp;GraphRAG&nbsp;(Local, Global, and Drift Search),&nbsp;Vector&nbsp;RAG&nbsp;with 8k-&nbsp;and 120k-token windows, and&nbsp;three&nbsp;published&nbsp;methods:&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/HKUDS\/LightRAG\" target=\"_blank\" rel=\"noopener noreferrer\">LightRAG<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/profintegra\/raptor-rag\" target=\"_blank\" rel=\"noopener noreferrer\">RAPTOR<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, and&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/arxiv.org\/abs\/2503.02922\" target=\"_blank\" rel=\"noopener noreferrer\">TREX<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;All&nbsp;methods&nbsp;were limited to the same 8k&nbsp;tokens&nbsp;for answer generation.&nbsp;GraphRAG&nbsp;Global Search&nbsp;used&nbsp;level 2&nbsp;of the community hierarchy.<\/p>\n\n\n\n<p>LazyGraphRAG outperformed every comparison condition using the same generative model (GPT-4o), winning all 96 comparisons, with all but one reaching statistical significance. The best overall performance came from the larger budget, smaller chunk size configuration (LGR_b200_c200). For DataLocal queries, the smaller budget (LGR_b50_c200) performed slightly better, likely because fewer chunks were relevant. For ActivityLocal queries, the larger chunk size (LGR_b50_c600) had a slight edge, likely because longer chunks provide a more coherent context.<\/p>\n\n\n\n<p>Competing methods performed relatively better on the query classes for which they were designed: GraphRAG Global for global queries, Vector RAG for local queries, and GraphRAG Drift Search, which combines both strategies, posed the strongest challenge overall.<\/p>\n\n\n\n<p>Increasing Vector RAG\u2019s context window from 8k to 120k tokens did not improve its performance compared to LazyGraphRAG. This raised the question of how LazyGraphRAG would perform against Vector RAG with 1-million token context window containing most of the dataset.<\/p>\n\n\n\n<p>Figure 4 shows the follow-up experiment comparing LazyGraphRAG to Vector RAG using GPT-4.1 that enabled this comparison. Even against the 1M-token window, LazyGraphRAG achieved higher win rates across all comparisons, failing to reach significance only for the relevance of answers to DataLocal queries. These queries tend to benefit most from Vector RAG\u2019s ranking of directly relevant chunks, making it hard for LazyGraphRAG to generate answers that have greater <em>relevance<\/em> to the query, even though these answers may be dramatically more comprehensive, diverse, and empowering overall.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1200\" height=\"1000\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1.png\" alt=\"Bar charts with the y-axes representing win rates for LazyGraphRAG. The x-axes contain comparison conditions for vector-based RAG with 8 thousand, 120 thousand, and 1 million token context windows. Charts are faceted by query class and quality metric.\" class=\"wp-image-1140735\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1.png 1200w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1-300x250.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1-1024x853.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1-768x640.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/Figure4-1-216x180.png 216w\" sizes=\"auto, (max-width: 1200px) 100vw, 1200px\" \/><figcaption class=\"wp-element-caption\">Figure 4.&nbsp;Win rates of&nbsp;LazyGraphRAG&nbsp;(LGR)&nbsp;over Vector RAG&nbsp;across different&nbsp;context window sizes, broken down by the&nbsp;four&nbsp;AutoQ&nbsp;query classes&nbsp;and&nbsp;four&nbsp;AutoE&nbsp;metrics:&nbsp;comprehensiveness, diversity, empowerment, and relevance.&nbsp;Bars above 50%&nbsp;indicate&nbsp;that&nbsp;LazyGraphRAG&nbsp;outperformed&nbsp;the&nbsp;comparison&nbsp;condition.&nbsp;<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"autod-automated-data-sampling-and-summarization\">AutoD: Automated data sampling and summarization<\/h2>\n\n\n\n<p>Text datasets have an underlying topical structure, but the depth, breadth, and connectivity of that structure can vary widely. This variability makes it difficult to evaluate RAG systems consistently, as results may reflect the idiosyncrasies of the dataset rather than the system\u2019s general capabilities.<\/p>\n\n\n\n<p>The AutoD component addresses this by sampling datasets to meet a target specification, defined by the number of topic clusters (breadth) and the number of samples per cluster (depth). This creates consistency across datasets, enabling more meaningful comparisons, as structurally aligned datasets lead to comparable AutoQ queries, which in turn support consistent AutoE evaluations.<\/p>\n\n\n\n<p>AutoD also includes tools for summarizing input or output datasets in a way that reflects their topical coverage. These summaries play an important role in the AutoQ query synthesis process, but they can also be used more broadly, such as in prompts where context space is limited.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"supporting-the-community-with-open-data-and-tools\">Supporting the&nbsp;community with open data and tools&nbsp;<\/h2>\n\n\n\n<p>Since the release of the&nbsp;GraphRAG&nbsp;paper,&nbsp;we\u2019ve&nbsp;received&nbsp;many requests&nbsp;to share the&nbsp;dataset of the&nbsp;<a href=\"https:\/\/www.microsoft.com\/en-us\/behind-the-tech\" target=\"_blank\" rel=\"noreferrer noopener\"><em>Behind the Tech<\/em><\/a>&nbsp;podcast&nbsp;transcripts&nbsp;we used in our evaluation.&nbsp;An updated version of this dataset&nbsp;is now available in the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/benchmark-qed\" target=\"_blank\" rel=\"noopener noreferrer\">BenchmarkQED&nbsp;repository<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, alongside&nbsp;the&nbsp;AP News dataset&nbsp;containing&nbsp;1,397&nbsp;health-related&nbsp;articles,&nbsp;licensed for open&nbsp;release.&nbsp;&nbsp;<\/p>\n\n\n\n<p>We hope these datasets, together with&nbsp;the&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/benchmark-qed\" target=\"_blank\" rel=\"noopener noreferrer\">BenchmarkQED&nbsp;tools<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>,&nbsp;help&nbsp;accelerate&nbsp;benchmark-driven&nbsp;development of&nbsp;RAG systems and&nbsp;AI question-answering.&nbsp;We invite the community to try them&nbsp;on&nbsp;<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/benchmark-qed\" target=\"_blank\" rel=\"noopener noreferrer\">GitHub<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>BenchmarkQED is an open-source toolkit for benchmarking RAG systems using automated query generation, evaluation, and dataset prep. It shows that LazyGraphRAG outperforms standard methods, especially on complex, global queries.<\/p>\n","protected":false},"author":43868,"featured_media":1140721,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Darren Edge","user_id":"31509"},{"type":"user_nicename","value":"Ha Trinh","user_id":"43245"},{"type":"user_nicename","value":"Andres Morales Esquivel","user_id":"43281"},{"type":"user_nicename","value":"Jonathan Larson","user_id":"32385"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1140429","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Darren Edge","user_id":31509,"display_name":"Darren Edge","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/daedge\/\" aria-label=\"Visit the profile page for Darren Edge\">Darren Edge<\/a>","is_active":false,"last_first":"Edge, Darren","people_section":0,"alias":"daedge"},{"type":"user_nicename","value":"Ha Trinh","user_id":43245,"display_name":"Ha Trinh","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/trinhha\/\" aria-label=\"Visit the profile page for Ha Trinh\">Ha Trinh<\/a>","is_active":false,"last_first":"Trinh, Ha","people_section":0,"alias":"trinhha"},{"type":"user_nicename","value":"Andres Morales Esquivel","user_id":43281,"display_name":"Andres Morales Esquivel","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/andresmor\/\" aria-label=\"Visit the profile page for Andres Morales Esquivel\">Andres Morales Esquivel<\/a>","is_active":false,"last_first":"Morales Esquivel, Andres","people_section":0,"alias":"andresmor"},{"type":"user_nicename","value":"Jonathan Larson","user_id":32385,"display_name":"Jonathan Larson","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jolarso\/\" aria-label=\"Visit the profile page for Jonathan Larson\">Jonathan Larson<\/a>","is_active":false,"last_first":"Larson, Jonathan","people_section":0,"alias":"jolarso"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Diagram showing how the dimensions of query source (data-driven vs activity-driven) and query scope (local vs global) create four query classes that span the local-to-global query spectrum: data-local, activity-local, data-global, and activity-global.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/05\/BenchmarkQED-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/daedge\/\" title=\"Go to researcher profile for Darren Edge\" aria-label=\"Go to researcher profile for Darren Edge\" data-bi-type=\"byline author\" data-bi-cN=\"Darren Edge\">Darren Edge<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/trinhha\/\" title=\"Go to researcher profile for Ha Trinh\" aria-label=\"Go to researcher profile for Ha Trinh\" data-bi-type=\"byline author\" data-bi-cN=\"Ha Trinh\">Ha Trinh<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/andresmor\/\" title=\"Go to researcher profile for Andres Morales Esquivel\" aria-label=\"Go to researcher profile for Andres Morales Esquivel\" data-bi-type=\"byline author\" data-bi-cN=\"Andres Morales Esquivel\">Andres Morales Esquivel<\/a>, and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jolarso\/\" title=\"Go to researcher profile for Jonathan Larson\" aria-label=\"Go to researcher profile for Jonathan Larson\" data-bi-type=\"byline author\" data-bi-cN=\"Jonathan Larson\">Jonathan Larson<\/a>","formattedDate":"June 5, 2025","formattedExcerpt":"BenchmarkQED is an open-source toolkit for benchmarking RAG systems using automated query generation, evaluation, and dataset prep. It shows that LazyGraphRAG outperforms standard methods, especially on complex, global queries.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1140429","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1140429"}],"version-history":[{"count":20,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1140429\/revisions"}],"predecessor-version":[{"id":1141932,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1140429\/revisions\/1141932"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1140721"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1140429"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1140429"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1140429"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1140429"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1140429"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1140429"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1140429"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1140429"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1140429"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1140429"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1140429"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}