{"id":1051977,"date":"2024-07-02T06:37:44","date_gmt":"2024-07-02T13:37:44","guid":{"rendered":""},"modified":"2024-07-02T06:37:46","modified_gmt":"2024-07-02T13:37:46","slug":"graphrag-new-tool-for-complex-data-discovery-now-on-github","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/graphrag-new-tool-for-complex-data-discovery-now-on-github\/","title":{"rendered":"GraphRAG: New tool for complex data discovery now on GitHub"},"content":{"rendered":"\n
\"GraphRAG<\/figure>\n\n\n\n
\n\t
\n\t\t
\n\t\t\t\t\t\tTool<\/span>\n\t\t\tGraphRAG<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n
\n\t
\n\t\t
\n\t\t\t\t\t\tTool<\/span>\n\t\t\tGraphRAG Accelerator<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n

Earlier this year, we introduced GraphRAG<\/a>, a graph-based approach to retrieval-augmented generation (RAG) that enables question-answering over private or previously unseen datasets. Today, we\u2019re pleased to announce that GraphRAG is now available on GitHub (opens in new tab)<\/span><\/a>, offering more structured information retrieval and comprehensive response generation than naive RAG approaches. The GraphRAG code repository is complemented by a solution accelerator (opens in new tab)<\/span><\/a>, providing an easy-to-use API experience hosted on Azure that can be deployed code-free in a few clicks. <\/p>\n\n\n\n

GraphRAG uses a large language model (LLM) to automate the extraction of a rich knowledge graph from any collection of text documents. One of the most exciting features of this graph-based data index is its ability to report on the semantic structure of the data prior to any user queries. It does this by detecting \u201ccommunities\u201d of densely connected nodes in a hierarchical fashion, partitioning the graph at multiple levels from high-level themes to low-level topics, as illustrated in Figure 1. Using an LLM to summarize each of these communities creates a hierarchical summary of the data, providing an overview of a dataset without needing to know which questions to ask in advance. Each community serves as the basis of a community summary<\/em> that describes its entities and their relationships.<\/p>\n\n\n\n

\"Figure
Figure 1. Knowledge graph of entity nodes and relationship edges derived from a news dataset<\/a>, with different colors representing various communities. Level 0 communities (left) represent the highest-level themes of the dataset, while level 1 communities (right) show the emergence of more granular topics within these themes.<\/figcaption><\/figure>\n\n\n\n

Advantages of community summaries for \u201cglobal questions\u201d<\/h2>\n\n\n\n

In a recent preprint<\/a>, we explore how these community summaries can also help answer global questions<\/em>\u2014which address the entire dataset rather than focusing on specific chunks of text\u2014where naive RAG approaches based on vector search fall short. For example, consider the question \u201cWhat are the main themes in the dataset?\u201d This is a reasonable starting point but one where naive RAG will always give misleading answers. This is because it generates answers from chunks of text semantically similar to the question, not necessarily from the subset of input texts needed to answer it.<\/p>\n\n\n\n

However, if a question addresses the entire dataset, all<\/em> input texts should be considered. Since naive RAG only considers the top-k <\/em>most similar chunks of input text, it fails. Even worse, it will match the question against chunks of text that are superficially similar to that question, resulting in misleading answers. Community summaries help answer such global questions because the graph index of entity and relationship descriptions has already considered all input texts in its construction. Therefore, we can use a map-reduce approach for question answering that retains all relevant content from the global data context:<\/p>\n\n\n\n

    \n
  1. Group community reports up to the LLM context window size. <\/li>\n\n\n\n
  2. Map the question across each group to create community answers. <\/li>\n\n\n\n
  3. Reduce all relevant community answers into a final global answer.  <\/li>\n<\/ol>\n\n\n\n

    Evaluation and results <\/h2>\n\n\n\n

    To evaluate this approach against naive RAG and hierarchical source-text summarization, we used the LLM GPT-4 to generate a diverse set of activity-centered sense-making questions from short descriptions of two datasets: podcast transcripts and news articles. We then selected three metrics for head-to-head comparison of generated answers, as evaluated by an LLM judge: comprehensiveness (covers all aspects in detail), diversity (provides different perspectives), and empowerment (supports informed decision making).<\/p>\n\n\n\n

    The results show that GraphRAG, when using community summaries at any level of the community hierarchy, outperforms naive RAG on comprehensiveness and diversity (~70\u201380% win rate). GraphRAG using intermediate- and low-level community summaries also performed better than source text summarization on these metrics at lower token costs (~20\u201370% token use per query). Performance was competitive with hierarchical source text summarization for the highest-level communities at substantially lower token costs (~2\u20133% token use per query). This is shown in Figure 2.<\/p>\n\n\n\n

    \"Question:. These individuals have been highlighted for various reasons, ranging from their professional achievements to their personal lives. [...] [...] [...] [...] These figures are frequently covered due to their high-profile status and the public\u2019s interest in their careers and personal lives. Their activities, whether in music, sports, or personal relationships, have significant cultural and economic impacts, as evidenced by the media coverage and public reactions. GraphRAG response: The entertainment industry is vast and diverse, encompassing film, television, music, sports, and digital media. Certain public figures stand out due to their significant contributions and influence across these sectors. The following summary highlights key individuals who are repeatedly mentioned in various entertainment articles, reflecting their impact and presence within the industry. Actors and Directors [...20 figures...] Public Figures in Controversy [...3 figures...] Musicians and Executives [...5 figures...] Athletes and Coaches [...7 figures...] Influencers and Entrepreneurs [...3 figures...] The repeated mention of these figures in entertainment articles signifies their ongoing relevance and the public\u2019s interest in their work. Their influence spans across various aspects of entertainment, from shaping cultural narratives in film and television to driving trends in music and digital media. These individuals not only contribute to their respective fields but also influence the broader cultural landscape, often becoming central figures in social discussions and public discourse. LLM evaluation: Comprehensiveness: Winner = GraphRAG; Diversity: Winner = GraphRAG; Empowerment: Winner = GraphRAG. \" class=\"wp-image-1052745\"\/>
    Figure 2. Comparison of naive RAG and GraphRAG responses to a global question about a news dataset<\/a> indicates that GraphRAG outperformed na\u00efve RAG in terms of comprehensiveness, diversity, and empowerment.<\/figcaption><\/figure>\n\n\n\n\t
    \n\t\t\n\n\t\t

    \n\t\tPODCAST SERIES<\/span>\n\t<\/p>\n\t\n\t

    \n\t\t\t\t\t\t
    \n\t\t\t\t\n\t\t\t\t\t\"Illustrated\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
    \n\n\t\t\t\t\t\t\t\t\t

    The AI Revolution in Medicine, Revisited<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

    Join Microsoft\u2019s Peter Lee on a journey to discover how AI is impacting healthcare and what it means for the future of medicine.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

    \n\t\t\t\t\t
    \n\t\t\t\t\t\t\n\t\t\t\t\t\t\tListen now\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

    Research insights and future directions <\/h2>\n\n\n\n

    Through the initial research cycle, we demonstrated that LLMs can successfully derive rich knowledge graphs from unstructured text inputs, and these graphs can support a new class of global queries for which (a) naive RAG cannot generate appropriate responses, and (b) hierarchical source text summarization is prohibitively expensive per query. The overall suitability of GraphRAG for any given use case, however, depends on whether the benefits of structured knowledge representations, readymade community summaries, and support for global queries outweigh the upfront costs of graph index construction.<\/p>\n\n\n\n

    We\u2019re currently exploring various approaches to reduce these costs while maintaining response quality. Our latest work on automatically tuning LLM extraction prompts to the problem domain is an example of how we are reducing the upfront effort required to customize these prompts, enumerate entity types, create few-shot examples, and so on. To enable evaluation of GraphRAG with minimal upfront indexing costs, we’re also investigating NLP-based approaches to approximating the knowledge graph and community summaries that would be generated by a full indexing process. Our goal is to ensure that, whatever the constraints of the deployment context, there is a GraphRAG configuration that can accommodate these constraints while still delivering exceptional response quality.<\/p>\n\n\n\n

    \n\t
    \n\t\t
    \n\t\t\t\t\t\tTool<\/span>\n\t\t\tGraphRAG Accelerator<\/span> <\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n\n\n\n

    By making GraphRAG and a solution accelerator (opens in new tab)<\/span><\/a> publicly available, we aim to make graph-based RAG approaches more accessible for users and use cases where it\u2019s critical to understand data at a global level. We encourage community feedback and suggestions on both the code repository and solution accelerator as we work together to enable the next generation of RAG experiences.<\/p>\n\n\n\n

    Acknowledgements<\/h2>\n\n\n\n

    Joshua Bradley<\/a>, Christine Caggiano, M\u00f3nica Carvajal, Alex Chao<\/a>, Newman Cheng, Ed Clark, Ben Cutler<\/a>, Andres Morales Esquivel<\/a>, Nathan Evans<\/a>, Alonso Guevara Fern\u00e1ndez<\/a>, Amber Hoak<\/a>, Kate Lytvynets<\/a>, Gaudy Blanco Meneses<\/a>, Apurva Mody<\/a>, Robert Ness<\/a>, Gabriel Nieves-Ponce, Douglas Orbaker, Richard Ortega, Rodrigo Racanicci, Billie Rinaldi, Katy Smith, Sarah Smith<\/a>, Shane Solomon, Dayenne Souza<\/a>, David Tittsworth<\/a>, Chris Trevino<\/a>, Derek Worthen<\/p>\n","protected":false},"excerpt":{"rendered":"

    GraphRAG, a graph-based approach to retrieval-augmented generation (RAG) that significantly improves question-answering over private or previously unseen datasets, is now available on GitHub.<\/p>\n","protected":false},"author":42735,"featured_media":1052106,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Darren Edge","user_id":"31509"},{"type":"user_nicename","value":"Ha Trinh","user_id":"43245"},{"type":"user_nicename","value":"Steven Truitt","user_id":"43143"},{"type":"user_nicename","value":"Jonathan Larson","user_id":"32385"}],"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1051977","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[901101],"related-projects":[1027041],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Darren Edge","user_id":31509,"display_name":"Darren Edge","author_link":"Darren Edge<\/a>","is_active":false,"last_first":"Edge, Darren","people_section":0,"alias":"daedge"},{"type":"user_nicename","value":"Ha Trinh","user_id":43245,"display_name":"Ha Trinh","author_link":"Ha Trinh<\/a>","is_active":false,"last_first":"Trinh, Ha","people_section":0,"alias":"trinhha"},{"type":"user_nicename","value":"Steven Truitt","user_id":43143,"display_name":"Steven Truitt","author_link":"Steven Truitt<\/a>","is_active":false,"last_first":"Truitt, Steven","people_section":0,"alias":"steventruitt"},{"type":"user_nicename","value":"Jonathan Larson","user_id":32385,"display_name":"Jonathan Larson","author_link":"Jonathan Larson<\/a>","is_active":false,"last_first":"Larson, Jonathan","people_section":0,"alias":"jolarso"}],"msr_type":"Post","featured_image_thumbnail":"\"GraphRAG","byline":"Darren Edge<\/a>, Ha Trinh<\/a>, Steven Truitt<\/a>, and Jonathan Larson<\/a>","formattedDate":"July 2, 2024","formattedExcerpt":"GraphRAG, a graph-based approach to retrieval-augmented generation (RAG) that significantly improves question-answering over private or previously unseen datasets, is now available on GitHub.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1051977","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42735"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1051977"}],"version-history":[{"count":33,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1051977\/revisions"}],"predecessor-version":[{"id":1052817,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1051977\/revisions\/1052817"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1052106"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1051977"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1051977"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1051977"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1051977"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1051977"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1051977"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1051977"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1051977"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1051977"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1051977"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1051977"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}