{"id":1019925,"date":"2024-04-03T09:00:00","date_gmt":"2024-04-03T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/research-focus-week-of-april-1-2024\/"},"modified":"2024-07-18T07:57:16","modified_gmt":"2024-07-18T14:57:16","slug":"research-focus-week-of-april-1-2024","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/research-focus-week-of-april-1-2024\/","title":{"rendered":"Research Focus: Week of April 1, 2024"},"content":{"rendered":"\n

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code\/datasets, new hires and other milestones from across the research community at Microsoft.<\/em><\/p><\/blockquote><\/figure>\n\n\n\n

\"Research<\/figure>\n\n\n\n

NEW RESEARCH<\/h3>\n\n\n\n

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error<\/h2>\n\n\n\n

In the same way that tools can help people complete tasks beyond their innate abilities, tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a surprisingly understudied question is how accurately an LLM uses tools for which it has been trained.<\/p>\n\n\n\n

In a recent paper: LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error<\/a>, researchers from Microsoft find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate of 30% to 60%, which is too unreliable for practical use. They propose a biologically inspired method for tool-augmented LLMs \u2013 simulated trial and error (STE) \u2013 that orchestrates three key mechanisms: trial and error, imagination, and memory. STE simulates plausible scenarios for using a tool, then the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration. Experiments on ToolBench show STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n\n\n\n\t
\n\t\t\n\n\t\t

\n\t\tSpotlight: blog post<\/span>\n\t<\/p>\n\t\n\t

\n\t\t\t\t\t\t
\n\t\t\t\t\n\t\t\t\t\t\"GraphRAG\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t
\n\n\t\t\t\t\t\t\t\t\t

GraphRAG auto-tuning provides rapid adaptation to new domains<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

GraphRAG uses LLM-generated knowledge graphs to substantially improve complex Q&A over retrieval-augmented generation (RAG). Discover automatic tuning of GraphRAG for new datasets, making it more accurate and relevant.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t

\n\t\t\t\t\t
\n\t\t\t\t\t\t\n\t\t\t\t\t\t\tRead more\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div>\n\t<\/div>\n\t<\/div>\n\t\n\n\n

NEW RESEARCH<\/h3>\n\n\n\n

Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks<\/h2>\n\n\n\n

The latest LLMs have surpassed the performance of older language models on several tasks and benchmarks, sometimes approaching or even exceeding human performance. Yet, it is not always clear whether this is due to the increased capabilities of these models, or other effects, such as artifacts in datasets, test dataset contamination, and the lack of datasets that measure the true capabilities of these models.<\/p>\n\n\n\n

As a result, research to comprehend LLM capabilities and limitations has surged of late. However, much of this research has been confined to English, leaving LLM building and evaluation for non-English languages relatively unexplored. Several new LLMs have been introduced recently, necessitating their evaluation on non-English languages. In a recent paper: MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks<\/a>, researchers from Microsoft aim to perform a thorough evaluation of the non-English capabilities of state-of-the-art LLMs (GPT-3.5-Turbo, GPT-4, PaLM2, Mistral, Gemini, Gemma and Llama2) by comparing them on the same set of multilingual datasets. Their benchmark comprises 22 datasets covering 81 languages including several low-resource African languages. They also include two multimodal datasets in the benchmark and compare the performance of LLaVA-v1.5 and GPT-4-Vision. Experiments show that GPT-4 and PaLM2 outperform the Llama and Mistral models on various tasks, notably on low-resource languages, with GPT-4 outperforming PaLM2 on more datasets. However, issues such as data contamination must be addressed to obtain an accurate assessment of LLM performance on non-English languages.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n\n\n\n
\n\n\n\n

NEW RESEARCH<\/h3>\n\n\n\n

Training Audio Captioning Models without Audio<\/h2>\n\n\n\n

Automated Audio Captioning (AAC) is a process that creates text descriptions for audio recordings. Unlike Closed Captioning, which transcribes speech, AAC aims to describe all sounds in the audio (e.g. : A muffled rumble with people talking in the background while a siren blares in the distance). Typical AAC systems require expensive curated data of audio-text pairs, which often results in a shortage of suitable data, impeding model training.<\/p>\n\n\n\n

In this paper: Training Audio Captioning Models without Audio<\/a>, researchers from Microsoft and Carnegie Mellon University propose a new paradigm for training AAC systems, using text descriptions alone, thereby eliminating the requirement for paired audio and text descriptions. Their approach leverages CLAP, a contrastive learning model that uses audio and text encoders to create a shared vector representation between audio and text. For instance, the text \u201csiren blaring\u201d and its corresponding audio recording would share the same vector. The model is trained on text captions: a GPT language decoder generates captions conditioned on the pretrained CLAP text encoder and a mapping network. During inference, audio input is first converted to its vector using the pretrained CLAP audio <\/em>encoder and then a text caption is generated.<\/p>\n\n\n\n

The researchers find that the proposed text-only framework competes well with top-tier models trained on both text and audio, proving that efficient text-to-audio conversion is possible. They also demonstrated the ability to incorporate various writing styles, such as humorous, beneficial for tailoring caption generation to specific fields. Finally, they highlight that enriching training with LLM-generated text leads to improved performance and has potential in increasing vocabulary diversity. This is a research project with Microsoft AI principles into practice. If the system is used in a manner that is abusive or illegal or infringes on your rights or the rights of other people, you can report it at the Report Abuse Portal (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n

\n
Read the paper<\/a><\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"

In this issue: New research helps COMET embrace African languages; FeatUp improves deep features, a computer vision research cornerstone; LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error; Benchmarking LLMs across languages and more.<\/p>\n","protected":false},"author":37583,"featured_media":1021371,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556,243062,13545,13555],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1019925","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-audio-acoustics","msr-research-area-human-language-technologies","msr-research-area-search-information-retrieval","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[199561,199562,199565],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[144923,664548,907656],"related-projects":[1068003],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Jason Eisner","user_id":39970,"display_name":"Jason Eisner","author_link":"Jason Eisner<\/a>","is_active":false,"last_first":"Eisner, Jason","people_section":0,"alias":"jeisner"},{"type":"user_nicename","value":"Ben Van Durme","user_id":39468,"display_name":"Ben Van Durme","author_link":"Ben Van Durme<\/a>","is_active":false,"last_first":"Van Durme, Ben","people_section":0,"alias":"bevandur"},{"type":"user_nicename","value":"Yu Su","user_id":39492,"display_name":"Yu Su","author_link":"Yu Su<\/a>","is_active":false,"last_first":"Su, Yu","people_section":0,"alias":"yusu2"},{"type":"user_nicename","value":"Millicent Ochieng","user_id":40678,"display_name":"Millicent Ochieng","author_link":"Millicent Ochieng<\/a>","is_active":false,"last_first":"Ochieng, Millicent","people_section":0,"alias":"mochieng"},{"type":"user_nicename","value":"Maxamed Axmed","user_id":40888,"display_name":"Maxamed Axmed","author_link":"Maxamed Axmed<\/a>","is_active":false,"last_first":"Axmed, Maxamed","people_section":0,"alias":"maxmed"},{"type":"user_nicename","value":"Kalika Bali","user_id":32477,"display_name":"Kalika Bali","author_link":"Kalika Bali<\/a>","is_active":false,"last_first":"Bali, Kalika","people_section":0,"alias":"kalikab"},{"type":"user_nicename","value":"Soham Deshmukh","user_id":40312,"display_name":"Soham Deshmukh","author_link":"Soham Deshmukh<\/a>","is_active":false,"last_first":"Deshmukh, Soham","people_section":0,"alias":"sdeshmukh"},{"type":"user_nicename","value":"Benjamin Elizalde","user_id":41662,"display_name":"Benjamin Elizalde","author_link":"Benjamin Elizalde<\/a>","is_active":false,"last_first":"Elizalde, Benjamin","people_section":0,"alias":"benjaminm"},{"type":"user_nicename","value":"Dimitra Emmanouilidou","user_id":37461,"display_name":"Dimitra Emmanouilidou","author_link":"Dimitra Emmanouilidou<\/a>","is_active":false,"last_first":"Emmanouilidou, Dimitra","people_section":0,"alias":"diemmano"}],"msr_type":"Post","featured_image_thumbnail":"\"Research","byline":"","formattedDate":"April 3, 2024","formattedExcerpt":"In this issue: New research helps COMET embrace African languages; FeatUp improves deep features, a computer vision research cornerstone; LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error; Benchmarking LLMs across languages and more.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1019925"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/37583"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1019925"}],"version-history":[{"count":15,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1019925\/revisions"}],"predecessor-version":[{"id":1058265,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1019925\/revisions\/1058265"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1021371"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1019925"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1019925"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1019925"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1019925"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1019925"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1019925"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1019925"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1019925"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1019925"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1019925"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1019925"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}