Research Focus: Week of March 18, 2024

Published

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

Research Focus March 20, 2024

Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning

Large language models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In a recent paper: Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning, researchers from Microsoft propose CoT-Influx, a novel approach that pushes the boundary of few-shot chain-of-Thought (CoT) learning to improve LLM mathematical reasoning.

Given that adding more concise CoT examples in the prompt can improve LLM reasoning performance, CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples. The pruner first selects as many crucial CoT examples as possible and then prunes unimportant tokens to fit the context window. A math reasoning dataset with diverse difficulty levels and reasoning steps is used to train the pruner, along with a math-specialized reinforcement learning approach. As a result, by enabling more CoT examples with double the context window size in tokens, CoT-Influx significantly outperforms various prompting baselines across various LLMs (LLaMA2-7B, 13B, 70B) and 5 math datasets, achieving up to 4.55% absolute improvements. Remarkably, without any fine-tuning, LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. CoT-Influx serves as a plug-and-play module for LLMs and is compatible with most existing reasoning prompting techniques, such as self-consistency and self-verification.

Microsoft research podcast

What’s Your Story: Lex Story

Model maker and fabricator Lex Story helps bring research to life through prototyping. He discusses his take on failure; the encouragement and advice that has supported his pursuit of art and science; and the sabbatical that might inspire his next career move.

From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions

Organizations and individuals continuously strive to enhance their efficiency, improve time management, and optimize their work processes. Rapid advancements in AI, natural language processing, and machine learning technologies create new opportunities to develop tools that boost productivity. 

In a recent paper: From User Surveys to Telemetry-Driven Agents: Exploring the Potential of Personalized Productivity Solutions, researchers from Microsoft present a comprehensive, user-centric approach to understand preferences in AI-based productivity agents and develop personalized solutions. The research began with a survey of 363 participants, seeking to reveal users’ specific needs and preferences for productivity agents such as relevant productivity challenges of information workers, preferred communication style and approach towards solving problems, and privacy expectations. With the survey insights, the researchers then developed a GPT-4 powered personalized productivity agent that uses telemetry data gathered from information workers via Viva Insights to provide tailored assistance. The agent’s performance was compared with alternative productivity-assistive tools, such as the traditional dashboard and AI-enabled summaries, in a study involving 40 participants. The findings highlight the importance of user-centric design, adaptability, and the balance between personalization and privacy in AI-assisted productivity tools. The insights distilled from this study could support future research to further enhance productivity solutions, ultimately leading to optimized efficiency and user experiences for information workers.


LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

The size of the context window of a large language model (LLM) determines the amount of text that can be entered for processing to generate responses. The window size is specifically measured by anumber of tokens—larger windows are more desirable. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. 

In a recent paper: LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens, researchers from Microsoft introduce a new method that extends the context window of pre-trained LLMs to an impressive 2.048 million tokens, without requiring direct fine-tuning on texts with extremely long lengths, which are scarce, while maintaining performance at the level of the original short context window. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of this method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding and can reuse most pre-existing optimizations.


Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants

Conversational interactions with large language models (LLMs) enable programmers to obtain natural language explanations for various software development tasks. However, LLMs often leap to action without sufficient context, giving rise to implicit assumptions and inaccurate responses. Conversations between developers and LLMs are primarily structured as question-answer pairs, where the developer is responsible for asking the right questions and sustaining conversations across multiple turns.  

In a recent paper: Exploring Interaction Patterns for Debugging: Enhancing Conversational Capabilities of AI-assistants, researchers from Microsoft draw inspiration from interaction patterns and conversation analysis to design Robin, an enhanced conversational AI-assistant for debugging. Robin works with the developer collaboratively, creating hypotheses about the bug’s root cause, testing them using IDE debugging features such as breakpoints and watches, and then proposing fixes. A user study with 12 industry professionals shows that equipping the LLM-driven debugging assistant to (1) leverage the insert expansion interaction pattern; (2) facilitate turn-taking; and (3) utilize debugging workflows, leads to lowered conversation barriers, effective fault localization, and 5x improvement in bug resolution rates.


Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions

Generative AI (GenAI) systems, which can produce new content based on input like code, images, speech, video, and more, offer opportunities to increase user productivity in many tasks, such as programming and writing. However, while they boost productivity in some studies, many others show that users are working ineffectively with GenAI systems and actually losing productivity. These ‘ironies of automation’ have been observed for over three decades in human factors research on automation in aviation, automated driving, and intelligence.  

In a recent paper: Ironies of Generative AI: Understanding and mitigating productivity loss in human-AI interactions, researchers from Microsoft draw on this extensive research alongside recent GenAI user studies to outline four key reasons for why productivity loss can occur with GenAI systems: 1) a shift in users’ roles from production to evaluation; 2) unhelpful restructuring of workflows; 3) interruptions; and 4) a tendency for automation to make easy tasks easier and hard tasks harder. We then suggest how human factors research can also inform GenAI system design to mitigate productivity loss by using approaches such as continuous feedback, system personalization, ecological interface design, task stabilization, and clear task allocation. Grounding developments in GenAI system usability in decades of research aims to ensure that the design of human-AI interactions in this rapidly moving field learns from history instead of repeating it. 

Related publications

Continue reading

See all blog posts