Microsoft Research Lab - Asia Articles http://approjects.co.za/?big=en-us/research/ Tue, 10 Feb 2026 04:19:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 Phi-Ground: Improving how AI agents navigate screen interfaces http://approjects.co.za/?big=en-us/research/articles/phi-ground-improving-how-ai-agents-navigate-screen-interface/ Mon, 19 Jan 2026 07:28:22 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1160474 Imagine an AI assistant that can navigate a computer the same way humans do—clicking buttons, filling out forms, and moving between applications—all by simply interpreting what’s on the screen. This vision is becoming a reality through computer use agents—AI systems designed to operate software interfaces autonomously. Yet for these agents to function, they need to […]

The post Phi-Ground: Improving how AI agents navigate screen interfaces appeared first on Microsoft Research.

]]>
Imagine an AI assistant that can navigate a computer the same way humans do—clicking buttons, filling out forms, and moving between applications—all by simply interpreting what’s on the screen. This vision is becoming a reality through computer use agents—AI systems designed to operate software interfaces autonomously. Yet for these agents to function, they need to know exactly where to click and what to interact with on a screen.

This capability, called GUI grounding, enables computer use agents to locate specific elements on a graphical user interface (GUI). GUI grounding is the agent’s perception system that translates instructions like “click the Submit button” into exact screen coordinates.

Currently, grounding models succeed only 65% of the time, far from reliable enough for everyday use. A research team from Microsoft Research Asia conducted an extensive study to understand why, examining every stage of how these models are built and trained. Their work produced Phi-Ground (opens in new tab), a new model family that achieves state-of-the-art performance across all five grounding benchmarks among comparably-sized models.

The significance of GUI grounding

Computer use agents could transform how we interact with software in the digital world. These agents operate through graphical interfaces exactly as humans do, without requiring specialized APIs and allowing straightforward human oversight. This universality gives computer use agents broader potential than traditional robotic systems or specialized web automation tools.

GUI grounding serves as the agent’s interface with the digital world, the mechanism that determines whether the system succeeds or fails at its tasks. An agent that can’t reliably find what it’s looking for on screen is fundamentally limited, regardless of how well it can reason or plan.

Building better training data

Training more capable grounding models requires large-scale, high-quality data. The research team started with web pages from CommonCrawl, a massive public repository of internet content, and rendered them as screenshots to generate training examples. Yet web data contains substantial noise that can derail model training, from broken layouts and malformed pages to irrelevant content.

To address this, the team developed a multi-stage workflow to filter and refine the data, illustrated in Figure 1.

diagram
Figure 1. Workflow for refining the data obtained from CommonCrawl

Beyond CommonCrawl, the team incorporated open-source datasets, screenshots from web searches, and manually annotated examples for everyday scenarios, e.g. grounding for office software, where precision matters most. Together, these sources formed the training foundation for the Phi-Ground model family. Table 1 shows the final composition of the training data, including how many times each dataset was used during training (epoch) and its relative importance in the learning process (weight).

table
Table 1. The data used to train Phi-Ground

How to train grounding models

The team discovered that the order in which text and images are fed into the model significantly impacts performance. They tested two approaches: inputting text before images, and the reverse. The results in Table 2 show that text-first yields substantially better outcomes.

table
Table 2. Comparison of input order for text and images.

Why does this matter? Transformer models, the architecture underlying most modern AI systems, use causal processing, meaning earlier inputs cannot be updated using information from later ones. When images come first, the model processes visual information without receiving the user’s instructions (like “click the Submit button”). When text comes first, the model interprets visual information after receiving the instructions. It knows what to search for as it processes the image. For perception tasks like grounding, this instruction-aware process directly influences results. A simple change in input order produces this effect.

Researchers also examined computational costs during testing. These costs depend not only on model size but also on the number of image tokens—the units into which images are divided for processing. For perception tasks, more image tokens generally improve performance, but at what point do the benefits plateau?

To answer this, the team investigated the relationship among model size, image token count, and training data volume. Such experiments can guide developers balancing training efficiency with application speed, which are practical concerns beyond raw parameter counts.

Figure 2 illustrates training results for six models with different sizes and image detail levels. Based on these evaluations, inference time generally aligns with the computational relationship, shown on the x-axis. Notably, many current studies report only the number of parameters when discussing model performance, overlooking computational factors like image token count, a gap that this research addresses directly.

In the experiments where the model architecture is fixed, the team found that for advanced benchmarks like ScreenSpot-Pro and UI-Vision, image token count significantly impacts performance. When the image token count falls below a certain threshold, they become a bottleneck. The model cannot perceive small interface elements, reducing accuracy. However, beyond approximately 2,000 image tokens, the benefits plateau. Additional tokens provide little additional benefit.

chart
Figure 2. Relationship between training computation and accuracy

Evaluation results

Table 3 presents test results comparing several open-source models with fewer than 10 billion parameters across five GUI grounding benchmarks. The upper section shows results using the benchmark’s standard reference expressions, typically brief instructions or short phrases. The lower section shows results when researchers used OpenAI’s o4-mini model to generate longer, more detailed reference expressions, which were then tested by the grounding models.

table
Table 3. Comparison of results across five GUI grounding benchmarks

The Phi-Ground model is trained specifically for agent applications, with training data consisting primarily of various reference expressions. As a result, the model achieves state-of-the-art results across all benchmarks in the agent setting. On ScreenSpot-Pro, it reached 55% accuracy. On UI-Vision, it attained 36.2%, the highest score reported for this benchmark. The model’s performance on the Showdown benchmark surpassed that of commercial systems like OpenAI Operator and Claude Computer Use.

Enabling Copilot to understand onscreen content

The core technology of the Phi-Ground model has been integrated into the Vision Highlighting feature of Windows Copilot. As shown in Figure 3, Copilot can guide users step-by-step through visual tasks, such as helping them construct a bubble graphic.

Table 3. Comparison of results across five GUI grounding benchmarks
Figure 3. A demo of Windows Copilot, integrated with the Phi-Ground model, helping users create dialogue bubbles in PowerPoint.

Beyond advancing GUI grounding, this research demonstrates how systematic study of training methods can unlock performance gains across multimodal AI systems. The integration into Windows Copilot marks an early step toward computer use agents that can genuinely assist with everyday digital tasks.

The post Phi-Ground: Improving how AI agents navigate screen interfaces appeared first on Microsoft Research.

]]>
Deep Video Discovery: Using agentic search to analyze long-form video http://approjects.co.za/?big=en-us/research/articles/deep-video-discovery-using-agentic-search-to-analyze-long-form-video/ Fri, 19 Dec 2025 08:01:03 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1159209 Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours‑long recordings packed with information and lacking obvious structure, current models are […]

The post Deep Video Discovery: Using agentic search to analyze long-form video appeared first on Microsoft Research.

]]>
Extracting useful information from long videos, whether meeting recordings, experimental data, or lecture content, requires painstaking manual review. AI tools offer some help: language-vision models can summarize short clips or answer questions when videos are divided into clear scenes or chapters. But for hours‑long recordings packed with information and lacking obvious structure, current models are limited. They process videos slowly, are unable to connect information across long stretches of content, and often provide limited or unhelpful answers.

To address these limitations, researchers at Microsoft Research Asia developed Deep Video Discovery (DVD), an agentic AI framework for long-video analysis. DVD divides long videos into shorter clips for individual analysis, then uses LLM-based reasoning to plan next steps and select appropriate tools. The agent retrieves needed information and uses it to answer complex questions about the video.

How DVD works

DVD operates through a simple cycle: observe the video content, analyze what it means, and choose the next action. Current video-analysis systems follow rigid, predesigned steps that have difficulty adapting to different tasks. In contrast, DVD adjusts its approach based on information it has gathered so far. To support this flexibility, the system operates in two stages:

Stage 1: Building a searchable video database

The system converts long videos into a structured database, dividing them into five-second clips and extracting information at three levels:

  • Global: Provides a topic-level summary of the video.
  • Clip‑level: Includes subtitles and brief text descriptions of each segment.
  • Frame‑level: Includes individual frames and visual details captured moment by moment.

Stage 2: Retrieving information and generating answers

The system uses three core tools to search the database:

  • Global browse: Provides high‑level context and video summaries.
  • Clip search: Retrieves clips that match a description and returns relevant results with subtitles and timestamps.
  • Frame inspect: Examines a specific moment in the video and extracts fine visual details; it can also answer questions about what appears in that frame.
DVD operates in two stages
Figure 1. DVD operates in two stages. First, it converts long videos into a searchable database organized by clips and frames at multiple scales. Then, it answers user queries through autonomous search and tool use.

The LLM serves as the system’s orchestrator, running repeated observe-reason-act cycles based on gathered information. This design gives the agent autonomy, ensures that its answers stay grounded in actual video content, and allows the system to break complex questions into smaller, more manageable sub-questions.

DVD achieves state-of-the-art performance across benchmarks

DVD achieved state-of-the-art performance across multiple long‑video benchmarks (Table 1). On the challenging LVBench dataset, DVD reached 74.2% accuracy, outperforming all existing methods, a 13.4‑point gain over the previous best method, MR. Video. When transcript data was available, accuracy rose to 76.0%.

DVD outperforms prior work by a large margin on LVBench
Table 1. DVD outperforms prior work by a large margin on LVBench.

DVD also exceeded previous state-of-the-art performance on three other long‑video benchmarks: LongVideoBench, Video MME Long, and EgoSchema, surpassing human‑level accuracy (approximately 76%) on EgoSchema.

The choice of reasoning model critically affects DVD’s performance (Figure 2). Replacing the reasoning model with OpenAI o4‑mini or GPT‑4o causes sharp performance drops, indicating that limited reasoning capability breaks down agent’s process. Different models also show distinct patterns in how they use tools; how deeply they analyze videos; and how accurately they respond. For example, GPT‑4o often exhibits “overconfidence,” stopping its analysis prematurely. These observations offer practical guidance for designing future agents and developing foundational LLMs.

Analysis of how different foundation models behave inside the agent. The results show clear differences across models.
Figure 2. Analysis of how different foundation models behave inside the agent. The results show clear differences across models.

Toward more comprehensive video understanding

As video content becomes richer and more complex, enabling AI to interpret and reason about what it captures, not just identify individual elements, is a central challenge in video comprehension. DVD offers one path forward through an agentic approach that is interpretable, plannable, and collaborative.

Looking forward, researchers at Microsoft Research Asia are working to develop agents with stronger contextual awareness, and more advanced reasoning capabilities, advancing toward AI systems that can handle complex videos with greater depth, precision, and automation.

The post Deep Video Discovery: Using agentic search to analyze long-form video appeared first on Microsoft Research.

]]>
Where AI meets neuroscience: Yansen Wang’s pursuit of human-centered innovation http://approjects.co.za/?big=en-us/research/articles/where-ai-meets-neuroscience-yansen-wangs-pursuit-of-human-centered-innovation/ Thu, 11 Dec 2025 10:27:25 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1158232 “Curiosity drives scientific breakthroughs, and the tools we create often reflect the human motivations behind that curiosity.” For Yansen Wang, a senior researcher at Microsoft Research Asia, this philosophy has guided his work at the intersection of AI and neuroscience. Wang’s interest in science began early. While his classmates spent hours searching for information to […]

The post Where AI meets neuroscience: Yansen Wang’s pursuit of human-centered innovation appeared first on Microsoft Research.

]]>
“Curiosity drives scientific breakthroughs, and the tools we create often reflect the human motivations behind that curiosity.”

For Yansen Wang, a senior researcher at Microsoft Research Asia, this philosophy has guided his work at the intersection of AI and neuroscience.

Wang’s interest in science began early. While his classmates spent hours searching for information to complete an assignment, he solved it in 10 minutes by writing a few lines of code. Since then, programming became his way of tackling complex challenges, and the satisfaction of creating practical solutions fueled his passion for computer science. As his studies advanced, his goals crystallized: he wanted to create technology that genuinely serves people.

Yansen Wang
Yansen Wang, senior researcher at Microsoft Research Asia

This people-centered approach has shaped Wang’s career. As an undergraduate at Tsinghua University, he witnessed AI defeat the world’s Go champion—a moment that sparked a realization: AI could help us understand how humans think. This insight led him to pursue research in multimodal AI for his master’s studies before joining Microsoft Research Asia – Shanghai. Today, in the AI/ML Group, Wang works to bridge AI and neuroscience.

A two-way journey between AI and neuroscience

Wang’s research focuses on two complementary directions: advancing neuroscience by decoding how the brain works (AI for Brain), and drawing inspiration from the brain to improve AI architectures and algorithms (Brain for AI).

For the AI for Brain project, Wang and his colleagues use noninvasive EEG(electroencephalogram) to build brain-computer interfaces, which record the electric signals from our brains without surgery, as their research platform. “Our understanding of the brain is still very limited,” he explains. “Its multitasking abilities and rapid adaptability are far beyond what AI can achieve, so we’re using AI to analyze the EEGsignals and uncover the links between perception, intention, and brain activity.”

The team has already made substantial progress. In terms of perception, they have decoded broad visual features, such as colors and simple moving scenes. Using diffusion models, they transformed neural signals into matching visual content and developed EEG2Video, a baseline model that reconstructs video clips from EEG signals. To improve generalization across different contexts, the team has built multiple datasets linking EEG signals to everyday behaviors. See the video below for an example of this work.

EEG2Video demo
EEG2Video demo 2
EEG2Video demo: The original video input is shown at the top and the reconstructed video at the bottom.

For controlling with commands, Yansen and his team tackled the challenge of decoding letters from brain signals. They introduced an innovative codebook approach: instead of having participants imagine letter shapes, which are difficult for devices to recognize, they guided participants to imagine body movements, mathematical calculations, and other semantic information that devices could more easily recognize. AI then mapped these signals to specific letters. With portable devices, this method has achieved 30%–40% accuracy with 36 options (26 letters and 10 digits).

“We’re now working to extend this approach to controlling mobile phones and interacting on the web, exploring new interaction modes beyond letter input,” he says.

Still, the approach of using noninvasive brain-computer interfaces comes with many challenges. EEG signals have low signal-to-noise ratios and are easily disrupted by environmental and physiological interference, such as eye movements or muscle activity, making reliable readings difficult on portable devices. EEG data is scarce and must be collected in a controlled setting. And individual differences mean that systems often don’t generalize well across users and tasks.

“We’re advancing research on EEG foundation models and hope to make them more robust with more data and larger models, much like large language models,” Wang explains.

Learning from the brain to make AI more efficient

For the Brain for AI project, Wang and his colleagues are exploring how brain function can address AI’s energy demands.

“The brain can accomplish tremendous thinking and computation with just the energy supplied from a bowl of rice, while AI requires vast resources and electricity to achieve similar results,” he observes. “Even more remarkable is that the brain efficiently handles many tasks without complex networks, like fine motor control. People only need a few examples to learn new tasks, but current AI models often need massive amounts of data to relearn. There must be design principles at play here that AI can learn from.”

The key differences lie in how neurons are structured and operate. Neurons use a spiking mechanism, firing and transmitting signals only when they reach their activation threshold. This results in extremely low energy consumption when they are at rest. Artificial neural networks, by contrast, perform large-scale computations even when there is very little information to process, using far more energy than the brain requires for similar tasks.

Using this insight, Wang and his team developed a more efficient spiking neural network (SNN) framework. In time-series prediction tasks, SNNs now perform comparably to traditional neural networks but can theoretically reduce energy consumption to a quarter of the latter, offering a new path for low-power AI (Figure 1).

diagram
Figure 1. The SNN framework and workflow for time-series prediction

“The spiking neural network research is just one part of our work,” says Wang. “Neurons in the brain are sparsely connected—each one typically links to only a few nearby neurons, while in artificial neural networks, a single neuron connects to thousands of others. The brain’s sparse connectivity also helps reduce energy consumption. If we continue learning from how the brain operates, AI will be able to generalize better and become more energy-efficient.”

Yansen Wang
Yansen Wang gives a lecture at the TEDxBeijing Innovation Conference

Breaking boundaries: From outsider to domain expert

For Wang, cross-disciplinary research in AI and neuroscience was new territory. He had no formal training in neuroscience or EEG analysis, but through dedicated study, active collaboration, and strong team support, he developed deep expertise.

When he began EEG research, Wang studied medical textbooks and sought guidance from collaborating physicians. “I set a rule: temporarily set aside my AI perspective and approach this as a doctor would. I studied medical textbooks thoroughly and learned to read EEG signals. Only then did I consider how AI could help.” This approach helped him understand clinical problems rather than imposing familiar frameworks.

Cross-disciplinary collaboration is not just about combining knowledge; it’s about how different perspectives collide. For example, in research on epilepsy detection, Wang discovered that AI researchers and physicians approach problems differently. AI researchers often assume that with enough data, models can learn to identify features of epileptic seizures, such as spikes and slow waves. But physicians can quickly spot the rare abnormality in massive amount of hard-to-interpret EEG signals based on experience. Models can miss these abnormalities, even when trained on vast amounts of data.

“This showed me that machine learning progress cannot rely solely on brute force. In data-scarce fields like medicine, we must incorporate domain expertise and build in the right assumptions to improve model performance.”

To help researchers build cross-domain knowledge, Microsoft Research Asia – Shanghai established a neuroscience study group, with weekly classes, homework, and discussions. After six months, Yansen had learned the fundamentals of neuroscience and gained practical guidance from senior researchers. “This collective learning atmosphere means we’re no longer working in isolation but instead growing together as a community,” he says.

Microsoft Research Asia encourages open exploration and open exchange. At the Shanghai lab’s weekly “Grand Challenge” meetings, researchers rigorously challenge one another’s work. “At first, I wasn’t used to this style of questioning,” Wang admits. “But I realized that these challenges expose blind spots and allow research to improve through iterative refinement. The toughest questions often lead to the most important breakthroughs.”

a group of people in a meeting
Yansen Wang (third from right) discusses research questions with colleagues.

Research with purpose: Building AI that serves people

For Wang, technology should serve people. Whether developing brain-computer interfaces or creating explainable AI for Go, the focus of the work should be on making AI useful and accessible.

In 2022, Yansen and his colleagues launched what they called a “human salvation project” for Go players. AI had surpassed top players, causing anxiety among professionals. Players could imitate AI moves but couldn’t understand the reasoning behind them. They memorized patterns without developing their own strategic thinking. “I thought, ‘If AI could explain its logic, players could truly understand the strategies behind the moves,’” Wang says. “We wanted to help people improve alongside AI.” Wang and the team are actively collaborating with Go lovers and professional Go players to verify the feasibility of the explanations. “That is so impressive!” one of the teachers in a Go learning institute says, “and I see the future of teaching human how to play Go”.

For Wang, this captures what drives his research: not the number of papers, but tangible impact. Perhaps it’s the moment when a player grasps a brilliant move, or when someone finds more convenient ways to interact with devices, or when researchers apply new approaches for energy-efficient AI.

“At Microsoft Research Asia, I can follow my interests and work with partners to solve meaningful problems for humanity,” he says.

The post Where AI meets neuroscience: Yansen Wang’s pursuit of human-centered innovation appeared first on Microsoft Research.

]]>
UI-E2I-Synth: Realistic and challenging UI grounding benchmark for computer-use agents http://approjects.co.za/?big=en-us/research/articles/ui-e2i-synth-realistic-and-challenging-ui-grounding-benchmark-for-computer-use-agents/ Mon, 24 Nov 2025 04:15:07 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1156403 AI assistants, designed to perform actions on behalf of users, may not be as capable as current benchmarks suggest. New research reveals that existing tests for UI grounding—the ability of assistants to locate elements in the graphical user interface (GUI)—have been overestimating the performance of visual language models (VLMs), which power these assistants. This becomes […]

The post UI-E2I-Synth: Realistic and challenging UI grounding benchmark for computer-use agents appeared first on Microsoft Research.

]]>
AI assistants, designed to perform actions on behalf of users, may not be as capable as current benchmarks suggest. New research reveals that existing tests for UI grounding—the ability of assistants to locate elements in the graphical user interface (GUI)—have been overestimating the performance of visual language models (VLMs), which power these assistants.

This becomes clear when comparing test conditions to real-world use. Current benchmarks use unrealistically large GUI elements—on a typical monitor, buttons and icons occupy a much smaller fraction of the screen—and test only a limited subset of element types like checkboxes and Submit buttons. Moreover, these benchmarks rely on simple, explicit instructions like “click the Save button” while neglecting the implicit language people actually use, like “click where I can change my password.”

To address this gap, a research team at Microsoft Research Asia has developed UI-E2I-Synth (opens in new tab), a large-scale data synthesis method that generates more realistic data, and UI-I2E-Bench (opens in new tab), a benchmark that better reflects actual computer use. The related paper has been accepted by The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) Annual Meeting of the Association for Computational Linguistics (ACL 2025)

Specifically, UI-I2E-Bench reflects typical 1080p and 1440p displays, labels whether each instruction is explicit or implicit, and includes a broader, more balanced range of UI elements compared with leading benchmarks, as shown in Figure 1.

chart, bar chart, waterfall chart
Figure 1: Left: ScreenSpot, a widely used benchmark for testing GUI grounding in multimodal models, shows interface elements that are disproportionally large compared to real-world desktops. Right: The ScreenSpot dataset mainly consists of text and icon elements, offering limited variety in other types of GUI components.

How the system training data

UI-E2I-Synth uses large language models (LLMs) to automatically generate realistic user instructions, reducing the manual effort required to label screenshots. The system works in three stages, each building on the previous one.

Gathering and organizing interface data. First, it collects UI screenshots and accompanying metadata from various platforms, including web pages, Windows applications, and Android interfaces. An automated tool then identifies and catalogs each UI element, recording details like whether it’s a button or text field, what it displays, and where it appears on screen. This step produces an organized catalog of UI elements that serves as the foundation for generating instructions.

Generating natural descriptions. Next, OpenAI’s GPT-4o analyzes these cataloged elements to create different ways users might realistically describe them, both explicit descriptions (e.g., “the blue Submit button in the top-right corner”) and implicit ones (e.g., “the confirmation button” or “the button next to the username”). This variety captures the range of ways users might refer to the same interface element.

Creating complete instructions. Finally, GPT-4o pairs these descriptions with specific actions to create complete, natural-sounding user instructions that reflect how people actually interact with interfaces, for example, “Click Send” or “Enter my password.” The result is a diverse set of instructions that more accurately reflects user behavior.

This process is illustrated in Figure 2.

diagram
Figure 2: UI-E2I-Synth’s three-stage process for generating realistic user instructions.

Training and testing more realistic models

From UI-E2I-Synth’s synthesized data, the team created UI-I2E-Bench, a new benchmark that reflects real-world conditions. It includes labels identifying the type of element and whether instructions are explicit or implicit, along with a realistic element-to-screen ratio—providing a rigorous test of vision-language models’ (VLMs) GUI grounding capabilities.

To evaluate the effectiveness of the proposed data synthesis pipeline, the team used almost 10 million individual instructions generated by UI-E2I-Synth to train two VLMs: UI-I2E-VLM-4B and UI-I2E-VLM-7B. UI-I2E-VLM-7B performed well across multiple benchmarks, including UI-I2E-Bench, using only 72% of the training data required by comparable state-of-the-art models.

The models performed especially well at handling indirect instructions and locating small elements, and more easily recognized challenging element types like icons and text entry fields. The results also confirmed that existing benchmarks overestimate model capabilities due to their unrealistically simple test conditions. The details of these results are shown in Table 1.

table
Table 1. Performance analysis by category of UI-I2E-VLM models on GUI grounding benchmarks with accuracy values given in percent. UI-I2E-VLM-7B achieved the best performance on the majority of tasks.

Diagnosing model strengths and weaknesses

The detailed labels of UI-I2E-Bench enabled the research team to analyze where models succeed and fail. The analysis revealed several key patterns.

Instruction complexity. Models showed the most improvement in handling implicit instructions. As shown in Table 2 leading models struggled with these realistic instructions, with  lagging by 12 percentage points, compared with on explicit instructions. Interestingly, systems powered by GPT-4o performed well on implicit instructions but struggled with explicit ones, primarily due to difficulty in locating small elements and uncommon interface components.

Element size matters. The smaller the interface element, the more accuracy dropped across all models. This confirms that small elements and high-resolution images are critical factors in model testing. Models trained with UI-E2I-Synth, which uses more training data and processes images with higher detail, performed better in locating these small elements.

Underrepresented element types. Existing models showed clear shortcomings with less common interface elements like icons and text entry fields. By balancing the distribution of element types in training data, UI-E2I-Synth directly addresses this gap and improves model performance.

table
Table 2. Detailed performance analysis by category on UI-I2E Bench, with accuracy values given in percent. UI-I2E-VLM-7B achieved the best performance on all tasks.

Raising the bar for UI Grounding

UI-E2I-Synth and UI-I2E-Bench address fundamental gaps in how GUI grounding models are trained and evaluated. Rather than relying on oversimplified benchmarks, this approach prepares models for the messy reality of actual computer interfaces—where elements are small, diverse, and instructions are often ambiguous.

The research establishes more rigorous standards for the field and could pave the way for AI assistants that can reliably navigate real-world software, moving these tools closer to practical deployment.

The post UI-E2I-Synth: Realistic and challenging UI grounding benchmark for computer-use agents appeared first on Microsoft Research.

]]>
UI-Evol: Compute-use Agents Act on Knowledge http://approjects.co.za/?big=en-us/research/articles/ui-evol-compute-use-agents-act-on-knowledge/ Mon, 17 Nov 2025 04:21:04 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1155826 Computer-use agents are AI systems that autonomously navigate and interact with software applications through graphical user interfaces (GUIs), and they are emerging as a new capability in artificial intelligence. By navigating and manipulating the same visual interfaces that people use, they can perform complex tasks on behalf of users, from filling out forms to managing […]

The post UI-Evol: Compute-use Agents Act on Knowledge appeared first on Microsoft Research.

]]>
Computer-use agents are AI systems that autonomously navigate and interact with software applications through graphical user interfaces (GUIs), and they are emerging as a new capability in artificial intelligence. By navigating and manipulating the same visual interfaces that people use, they can perform complex tasks on behalf of users, from filling out forms to managing workflows.

Yet despite their promise, these agents perform poorly in practice. They typically draw on external knowledge—information retrieved from the web that describes how to navigate the interfaces in question—and use it to interpret what’s on the screen and adapt to different environments. However, these agents often fail to translate this knowledge into successful action—a problem researchers call the “knowledge–action gap.”

A recent study shows that even when the instructions are 90% correct, agents perform tasks successfully only 41% of the time. This disconnect between having the needed information and effectively applying it, illustrated at the top of Figure 1, can lead to a frustrating user experience.

To address this, researchers at Microsoft Research Asia developed UI-Evol, a ready-to-use component that integrates into an agent’s workflow and relies on the actual user interface for guidance. UI-Evol continuously updates its interface knowledge, helping make agents more accurate and reliable when completing tasks, as shown in the bottom of Figure 1.

graphical user interface, text, application
Figure 1: The top shows how correct external knowledge still fails to work in real-world settings. The bottom shows how UI-Evol narrows this gap by aligning knowledge with the software environment, enabling more reliable performance.

This work has been recognized by the research community, with the team’s findings accepted at the ICML 2025 Workshop on Computer Use Agents (opens in new tab).

How UI-Evol works

UI-Evol addresses the knowledge-action gap through a two-stage process. The first stage, called retrace, records the exact steps an agent takes to finish a task. In this way, the system captures the specific clicks, keystrokes, and other actions that led to the result.

The second stage, critique, reviews those actions against instructions drawn from outside the application. If it finds mismatches, it adjusts the knowledge so that the steps reflect what actually works in practice. Together, these two stages turn external instructions into tested, reliable guidance for agents. This process is illustrated in Figure 2.

graphical user interface, diagram
Figure 2: UI-Evol’s two stages refine outside instructions with the agent’s real actions, producing guidance that works in practice.

Assessing UI-Evol’s effect on performance, reliability

The research team tested UI-Evol on Agent S2, a state-of-the-art computer-use agent. They used the OSWorld benchmark, designed to evaluate multimodal agents on open-ended computer tasks involving real software and workflows. They found that UI-Evol not only improved performance but also made the agent’s behavior more dependable.

Computer-use agents have long shown what researchers call “high behavioral standard deviation.” In plain terms, the same agent, given the same task, may act differently each time it tries to carry it out. This unpredictability has not been a central focus of earlier work, yet it is precisely what limits agents’ usefulness in real-world applications.
With UI-Evol, that pattern shifted. Experiments with agents based on leading LLMs, like GPT-4o and OpenAI-o3, showed not only higher success rates (Table 1) but also greater consistency with UI-Evol.

table
Table 1: Experiment results on OSWorld. “SR” denotes success rate. It shows that computer-use agents often behaved unpredictably. With UI-Evol, performance improved, and their behavior became more consistent.

What this means for practical AI

The introduction of UI-Evol tackles a problem that has long challenged computer use agents since their inception: the gap between what they know and what they can reliably do. As these agents move from research labs to real-world settings such as office automation, virtual assistants, and robotic process automation on software, consistency matters as much as capability.

UI-Evol’s approach—learning from actual agent behavior rather than relying on external knowledge alone—offers a path forward. It’s not only about making agents smarter; it’s about making them dependable enough to trust with real work.

The post UI-Evol: Compute-use Agents Act on Knowledge appeared first on Microsoft Research.

]]>
DocReward: Advancing professional document design through AI evaluation http://approjects.co.za/?big=en-us/research/articles/docreward-advancing-professional-document-design-through-ai-evaluation/ Thu, 13 Nov 2025 04:00:14 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1155599 In recent years, as the shift toward agentic AI has accelerated, automation has advanced to handle increasingly complex tasks, from document and code generation to image creation, visual understanding, and mathematical reasoning. This trend points to the growing need to transform traditional software into intelligent agents. When core productivity platforms like Microsoft Office evolve into […]

The post DocReward: Advancing professional document design through AI evaluation appeared first on Microsoft Research.

]]>
In recent years, as the shift toward agentic AI has accelerated, automation has advanced to handle increasingly complex tasks, from document and code generation to image creation, visual understanding, and mathematical reasoning. This trend points to the growing need to transform traditional software into intelligent agents. When core productivity platforms like Microsoft Office evolve into next-generation agents with autonomous reasoning and operational abilities, they can connect natural language and office automation in new ways, making work more efficient and precise.

One of the key challenges in this transformation lies in generating documents that are not only accurate in content but also well-structured and visually coherent. While most research has focused on improving text quality, the structural and stylistic dimensions of professional documents—layout, hierarchy, and readability—remain underexplored.

To address this gap, Microsoft Research Asia, in collaboration with The Chinese University of Hong Kong and the University of Chinese Academy of Sciences, has developed DocReward, a reward model that evaluates the structural and stylistic quality of AI-generated documents. By guiding agents to produce outputs that are clear, organized, and well-presented, DocReward provides crucial support for automated document creation.

Deep Research is an agent that gathers, analyzes, and synthesizes information from multiple sources into coherent, well-structured documents. Combined with DocReward, it completes a full workflow—from research and information integration to polished document presentation—laying the groundwork for transforming traditional office software into agent-driven systems.

table
Figure 1. DocReward automatically evaluates a document’s quality based on structure and style, supporting agentic workflows in generating polished documents.

Task modeling: Evaluating document structure and style

DocReward assigns quality scores to documents based on their layout and visual characteristics. For example, consider a set of documents, {Di}, where each document includes both its text content Dtext, i and corresponding rendered image Dimg, i. The reward model assigns scores to these documents to reflect their quality in terms of structure and stylistic presentation.

For a group of documents with identical content, the goal is for the reward model (Rθ) to predict scores in an order consistent with the true ranking of their structural and stylistic quality (π*). In doing so, the model learns to distinguish differences in layout and design even when the text remains the same, improving the accuracy of its evaluations.

Mathematically, this is expressed as the following:

text, letter

Definition of document structure and style professionalism:

  • Structure: Proper use of whitespace and margins; clear section separation; consistent text alignment; paragraph spacing and indentation; standardized headers and footers; and overall logical organization.
  • Style: Appropriate fonts (type, size, and color); readable heading hierarchy; effective use of bold and italics for emphasis; clear bullet and numbering formats; and consistent formatting throughout the document.

Constructing DocPair, the foundation for DocReward

To train DocReward, the research team built the DocPair dataset, which contains 117,000 document pairs spanning 32 domains and 267 document types. This diverse dataset enables the model to be optimized through preference learning to accurately assess structural and stylistic quality across a wide range of documents.

As shown in Figure 2, constructing the DocPair dataset involves three steps:

1. Curating high-quality professional documents

The team began by collecting a broad set of Microsoft Word files, ranging from formal institutional documents to routine business correspondence. Data sources include:

  • Government and institutional documents, which make up the GovDocs1 and NapierOne datasets. GovDocs1 contains a wide range of U.S. government materials, including policy reports, administrative forms, statistical reports, meeting minutes, and more. NapierOne features office documents from public institutions, all characterized by strong structural and stylistic standards.
  • Web documents, which consist of professionally authored files from the CommonCrawl database, spanning business, education, the nonprofit sector, and medicine. These include proposals, syllabi, newsletters, technical manuals, and policy briefs, contributing to broad diversity in document formats and presentation styles.

To ensure data quality, all documents were converted to .docx format and filtered to remove abnormal or incorrectly formatted files. The large language model (LLM) GPT-5 was then used to automatically score structure and style on a 0–10 scale, retaining only those scoring above 8.

The resulting dataset spans 32 domains and 267 document types and serves as the basis for subsequent document-pair construction. Figures 2 and 3 show the distribution of the top 10 domains and top 30 document types.

chart
Figure 2. Top 10 document domains
chart, bar chart
Figure 3. Top 30 document types

2. Expanding source documents via agents

To create document sets with identical text but varying structure and style, the team designed two types of document-generation agents:

  • Text-to-document generation agent: Extracts plain text from source documents, removes all structural and stylistic information, and then uses advanced LLMs (GPT-4o, Claude Sonnet 4, GPT-5, etc.) to generate .docx documents through python-docx code.
  • Structure and style optimization agent: Further refines the synthetic documents by referencing original human-written examples. This process involves two stages—first generating an optimization plan, then modifying .docx files via python-docx to improve structure and style.

3. Document ranking and annotation

Within each document group, all samples share the same text content. The team constructed two types of comparison pairs:

  • Human vs. synthetic documents: When a pair includes a real human-written document, that version is labeled as more professional.
  • Synthetic vs. synthetic documents: When both documents are synthetic, a human-written reference document is used as a guide, and GPT-5 annotates which synthetic version exhibits higher structural and stylistic quality.

The final DocPair dataset provides a solid foundation for training DocReward. When multi-page visual renderings are input to the vision encoder, a regression head is added to the language model. A special token is placed at the end of each image sequence, and its hidden state, processed by the regression head, predicts the document’s overall score.

Figure 4 illustrates the overall DocPair data construction process, summarizing the three main stages described above.

graphical user interface
Figure 4. DocPair data construction process

Training and evaluation

Training

DocReward is trained using the Bradley-Terry (BT) loss to learn from paired document preferences. Each document’s pages are input into the model, which outputs a score representing its structural and stylistic quality. The BT loss encourages DocReward to assign higher scores to preferred documents, helping it reliably distinguish differences in structure and style.

Mathematically, this is expressed as the following:

formular

Experiments and evaluation

The research team conducted a series of experiments to test DocReward’s effectiveness in evaluating document structural and stylistic quality.

Experiment 1: Preference accuracy evaluation

Researchers randomly sampled high-quality documents to build an evaluation dataset that included both human-written and synthetic documents generated by various LLMs, ensuring diversity in structure and style.

For each group of documents with identical text but differing structure and style, experienced Word users familiar with document design ranked them by quality. These rankings were then converted into 473 document-pair comparisons, with each pair annotated to indicate which document was superior.

As shown in Table 1, DocReward achieved significant improvements over strong baselines, including GPT-4o, Claude Sonnet 4, and GPT-5.

table
Table 1. Accuracy of different reward models in predicting human preferences for document structure and style.

DocReward-7B (with 7 billion parameters) achieved an overall human-preference accuracy of 89.22%, outperforming the best proprietary baseline, GPT-5 (69.77%), by 19.45 percentage points. Even in the more challenging synthetic-vs.-synthetic setting, DocReward-7B maintained 78.22% accuracy, compared with GPT-5’s 64.85%.

These results show that DocReward accurately recognizes differences in document structure and style that existing LLMs often overlook.

Experiment 2: Improving document generation with DocReward

To assess DocReward’s impact on real document-generation tasks, the team ran experiments in which AI agents produced multiple candidate documents from the same text. Different reward models were then used to select the best-structured and best-styled version as the final output.

Three reward strategies were compared: random selection, GPT-5 as the reward model, and DocReward as the reward model. Human evaluators assessed each final document for structure and style, recording win/lose/tie ratios.

As shown in Figure 5, random selection performed the worst (24.6% win rate); GPT-5 improved performance to 37.7%; and DocReward achieved a 60.8% win rate and only a 16.9% loss rate, significantly outperforming both baselines.

diagram
Figure 5. Comparison of reward model strategies for document generation

To visually demonstrate DocReward’s ability to assess structure and style, the team conducted sample analyses using documents with identical text but differing layouts, as shown in Figure 6.

chart
Figure 6. DocReward captures differences in document structure and style quality
  • Sample (a): The document has poor whitespace allocation: last-name field spacing is too small, and first-name field spacing is too large. This results in an unbalanced layout. Additionally, the key fields (Faculty/Department, Country, Country Code) are misaligned, creating visual clutter. The score is 1.21.
  • Sample (b): The table-like layout of the document is more organized than that of Sample (a), but the heading font is too small, lacking a clear distinction from the body text and weakening the visual hierarchy. Additionally, the input fields lack borders, making information harder to interpret. The score is 2.11.
  • Sample (c): The document features a clear, standardized structure, with a larger heading font, balanced whitespace, a well-aligned layout, and strong readability. The score is 5.34.

These examples show that DocReward accurately distinguishes differences in structural and stylistic quality, with scores consistent with human evaluations. Together, the experiments and sample analyses confirm that DocReward reliably guides agents to produce documents that aligned with human expectations for accuracy and presentation quality, supporting the agentic transformation of core office software like Microsoft Office.

The post DocReward: Advancing professional document design through AI evaluation appeared first on Microsoft Research.

]]>
OPA-DPO: Efficiently minimizing hallucinations in large vision-language models http://approjects.co.za/?big=en-us/research/articles/opa-dpo-efficiently-minimizing-hallucinations-in-large-vision-language-models/ Mon, 27 Oct 2025 03:54:46 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1153391 Large vision-language models are improving at describing images, yet hallucinations still erode trust by introducing contradictions and fabricated details that limit practical applications. In response, Microsoft Research Asia has developed On-Policy Alignment DPO (OPA-DPO), a new algorithm that aligns expert feedback with the model’s own output distribution before training begins. This “on-policy” alignment slightly alters […]

The post OPA-DPO: Efficiently minimizing hallucinations in large vision-language models appeared first on Microsoft Research.

]]>
Large vision-language models are improving at describing images, yet hallucinations still erode trust by introducing contradictions and fabricated details that limit practical applications.

In response, Microsoft Research Asia has developed On-Policy Alignment DPO (OPA-DPO), a new algorithm that aligns expert feedback with the model’s own output distribution before training begins. This “on-policy” alignment slightly alters the model so that expert corrections are close to what the model would naturally produce. As a result, the model is more likely to learn from these expert demonstrations, rather than treating them as outliers to be ignored.

Until now, most attempts to curb hallucinations have involved retraining models with extra data or applying filters to clean up their answers afterwards. While these approaches can help, they’re computationally expensive and don’t address the root issue: how models learn to distinguish accurate from misleading responses.

Direct Preference Optimization (DPO) has recently emerged as a solution. It trains models to favor accurate responses by learning from pairs of good and bad examples. However, when DPO is applied to vision-language tasks, it’s often inadequate because the expert-corrected examples differ too much from what the model would naturally generate, preventing effective learning.

OPA-DPO addresses this by providing a simpler and more data-efficient way to reduce hallucinations while using less training data than previous methods. This work has been recognized with an oral presentation at CVPR 2025.

Limitations of current DPO methods

Previous approaches fall into three categories:

  1. Hallucination injection, which inject hallucinated fragments into standard responses. Preference pairs are then constructed by pairing standard responses with their corresponding hallucinated versions.
  2. Hallucination recognition, where models generate responses and people or GPT-4/4v identifies and correct hallucinations. Preference pairs are then constructed by pairing corrected responses with their original versions.
  3. Self evolution, where models generate multiple responses and a hallucination-recognition model ranks them by severity. Preference pairs are constructed based on these ranking results.
graphical user interface, application
Figure 1. Three categories of previous approaches

Among these, self-evolution tends to perform best, followed by recognition and then injection. However, all three approaches face limitations. Hallucination injection is weak because the fabricated content does not reflect the model’s own tendencies. Self-evolution is more effective but computationally costly. Recognition, while seemingly the most intuitive, underperforms in practice because expert-edited responses are often too different from the model’s natural outputs. Standard DPO struggles to learn from this “off-policy” data, leading to vanishing gradients and little improvement.

These challenges highlight the need for a method that can incorporate expert corrections while staying aligned with the model’s own output distribution.

OPA-DPO: Breaking convention, reshaping alignment strategy

To address these challenges, OPA-DPO introduces an on-policy alignment step before DPO training. Using 4.8k training samples, OPA-DPO achieves state-of-the-art performance compared with the 16k required by previous state-of-the-art methods. This work was accepted as an oral presentation at CVPR 2025.

chart, diagram
Figure 2. OPA-DPO implementation method  

OPA-DPO aligns a model’s outputs with expert-preferred responses through a four-step process. First, it generates responses from the model using both the image and prompt. Next, expert feedback—such as that from GPT-4v—is used to finely edit these responses, correcting hallucinations while preserving accurate content.

The edited and ground-truth responses are then used to fine-tune the data-producing model via LoRA-SFT, resulting in what is referred to as the OPA model. Finally, DPO training is performed on the OPA model, incorporating language, image, and anchor preference pairs. Among these stages, the OPA step has the greatest impact on performance. This process is shown in Figure 3.

diagram
Figure 3. OPA-DPO achieves alignment in four steps

Researchers compared various DPO-based algorithms fine-tuned on LLaVA-1.5-7B and 13B. With only 4.8k training samples, OPA-DPO achieves state-of-the-art performance on 50% of hallucination metrics for LLaVA-Instruct-1.5-7B. This improves to 70% for LLaVA-Instruct-1.5-13B. OPA-DPO demonstrates particularly strong results on metrics that directly measure hallucination occurrence, such as CHAIR and HalRate. The results are shown in Table 1.

table
Table 1. To fairly compare various RLAIF/RLHF-enhanced LVLM algorithms, researchers used greedy-search algorithm to evaluate across multiple benchmarks, annotated sources to distinguish official reproductions from paper results, and bolded the best scores in each metric group.

Evaluating OPA-DPO

To validate the importance of OPA and data volume, researchers conducted ablation studies. Even with 600 training samples, OPA-DPO performs better than most baseline algorithms on hallucination-related metrics. As the data volume increases, the performance of OPA-DPO steadily improves. Incorporating the OPA operation leads to a nearly 50% reduction in AMBER HalRate and Object-halCHAIRs.

chart, line chart
Figure 4. Impact of training data volume and OPA operation on OPA-DPO (ablation study)

They also experimented with LLaVA-OneVision as the base model. Despite its detailed but redundant outputs and numerous hallucinations, OPA-DPO significantly improved hallucination metrics with 2.4k training samples, achieving a 43.2% reduction in HalRate and a 38.7% improvement in CHAIR scores compared to baseline models.

table
Table 2. Experimental results of OPA-DPO on LLaVA-OneVision

OPA-DPO-trained models tend to adopt a conservative strategy, emphasizing salient and verifiable observations while minimizing attention to ambiguous or less relevant details. As illustrated in Figure 5, this approach focuses the description on the actions of the three individuals at the center of the image, while deliberately ignoring peripheral elements such as trees and minor details like backpacks that are speculated by base models. By avoiding speculative or overly detailed content that could introduce hallucinations, the models prioritize clarity and reliability—contributing to their improved performance on hallucination metrics.

Impact of OPA operation on model output in image description tasks
Figure 5. Impact of OPA operation on model output in image description tasks

Interestingly, base models often assume the query language is accurate, even when it contains hallucinations, leading to responses that reinforce false premises. In contrast, OPA-DPO-trained models demonstrate the ability to detect and reject hallucinated content embedded in the query itself. As shown in Figure 6, this approach can identify fabricated elements—such as the mention of “hands” in the input prompt—and respond with clarifications or corrections rather than perpetuating the hallucination.

graphical user interface, text, application
Figure 6. In erroneous premise inquiry tasks, models trained with OPA-DPO show the ability to identify hallucinations in the query.

OPA-DPO not only improves algorithm performance but also advances multimodal alignment methods. Its approach of generating on-policy data from expert feedback marks a step forward in multimodal alignment training.

The post OPA-DPO: Efficiently minimizing hallucinations in large vision-language models appeared first on Microsoft Research.

]]>
Microsoft study shows AI assistants help with development for programmers who are blind or have low vision http://approjects.co.za/?big=en-us/research/articles/microsoft-study-shows-ai-assistants-help-with-development-for-programmers-who-are-blind-or-have-low-vision/ Tue, 30 Sep 2025 01:30:29 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1150836 Developers who are blind or have low vision have historically been limited to back-end programming, but new research suggests AI programming assistants are changing that in remarkable ways. A Microsoft Research Asia study found that developers who use screen readers can now tackle previously challenging tasks like UI development through an AI-assisted software development technique […]

The post Microsoft study shows AI assistants help with development for programmers who are blind or have low vision appeared first on Microsoft Research.

]]>
Developers who are blind or have low vision have historically been limited to back-end programming, but new research suggests AI programming assistants are changing that in remarkable ways. A Microsoft Research Asia study found that developers who use screen readers can now tackle previously challenging tasks like UI development through an AI-assisted software development technique where natural language replaces traditional syntax, also known as vibe coding.

The implications extend far beyond accommodation. Only 1.7% (roughly 1,100 of 70,000) of surveyed developers (opens in new tab) are blind or have low vision. Yet the Microsoft Research study shows that AI assistants can unlock new capabilities for this group, sometimes surpassing traditional methods.

“I used to do only non-UI development because my visual impairment made UI tasks difficult,” said one blind developer. “Now, I turn user feedback into prompts for GitHub Copilot to modify code, ask it to check the generated code, and send screenshots for review. I can even review the code myself. This has greatly simplified my workflow.”

The research

The Microsoft Research Asia team recruited 16 developers with varying experience levels and degrees of visual impairment for a comprehensive three-phase study examining real-world use of GitHub Copilot in Visual Studio Code.

In the first phase, participants completed onboarding and coding tasks. Then they used Copilot in their daily work for two weeks while documenting their experience. Final interviews captured long-term feedback on participants’ performance and sentiment.

GitHub Copilot proved ideal for the study because it already incorporates accessibility features: sound cues in addition to visual prompts, text-based views for layout clarity, and multimodal capabilities that convert visual content like screenshots into textual descriptions. The tool’s features are illustrated in Figure 1.

text
Figure 1. GitHub Copilot feature overview, showing the core functions of code completion, inline chat, and a dedicated chat panel with three modes: Ask, Edit, and Agent.

Beyond basic accommodation

“Through real-world use, participants consistently reported that AI programming tools improved efficiency, enhanced coding skills, and lowered the barrier to learning new technologies,” said Luna Qiu, technical program manager at Microsoft Research Asia – Shanghai. “More importantly, these tools used the multimodal capabilities of large models to assist with visual elements, expanding users’ capabilities.”

The study revealed how participants were adapting the new vibe coding approach to overcome traditional limitations. One developer explained: “I like to discuss plans or ask for explanations in Ask mode before letting GitHub Copilot handle my files.” Another noted the power of natural language: “I used natural language to ask GitHub Copilot to undo an operation—and it worked.”

But the benefits went beyond simple task completion. “Accessibility isn’t just about adding labels or shortcuts,” said another developer who is blind. “More types of cues, like sound effects, help me better understand changes. Too many text prompts can actually interfere with my code comprehension.”

For newcomers to programming, the impact was particularly striking. “With a code assistant like GitHub Copilot, getting started with programming is much easier,” one participant noted. “In daily life, we have all kinds of needs, and better programming capabilities help us meet personalized requirements.”

Video 1 shows how screen readers enable users to review code four times faster than they normally would.

Video 1

Video 2 shows the actual GitHub Copilot interface.

Video 2

Eight critical improvements

The research team identified specific pain points and solutions across four key areas of AI programming tools.

Managing AI interactions

More consistent shortcuts and clearer feedback: Users often run into conflicting keyboard shortcuts that don’t behave consistently across sessions. Because of this, some resort to clumsy workarounds like copying content to the clipboard and pasting it elsewhere for editing. We recommend creating
 a consistent and predictable shortcut system that minimizes conflicts, reduces extra navigation, and provides timely, accessible session settings.

Guidance on prompts and model choice: AI suggestions are sometimes too brief or based on incorrect assumptions, which requires users to repeatedly refine prompts. As users gain experience, tools should help by detecting vague prompts, asking for clarification, and offering straightforward guidance on selecting suitable AI models for the task.

Reviewing AI responses

Clearer responses: For developers using screen readers, audio cues can be unclear or distracting, and intermixed code changes are difficult to follow. We recommend a system that tracks changes through clear sound cues or text indicators, provides concise text summaries, and groups related information to reduce navigation effort and cognitive load.

Smarter message navigation: Lists of messages can help organize interactions, but navigation is often linear and inefficient. Long responses and input fields that are hard to exit add to the difficulty. We recommend a more navigable format that groups related messages, uses headings or indexes for orientation, minimizes misleading content, and provides reference information to build trust.

Accessible view, optimized: A plain-text accessibility view simplifies navigation but often loses important detail, especially in formatted content like tables. A simplified UI is valuable, but it should still preserve the completeness and integrity of information.

AI response playback: Automatic playback of AI responses can reduce manual effort, but long passages can interrupt thought flow and be hard to digest. We recommend making this “autoplay” optional so that users can choose their preferred interaction style.

Staying focused across views

Improving focus with integrated views: Switching between the editor, chat panel, and terminal can break concentration and increase the risk of errors. In Agent mode, developers must divide attention across multiple views, which makes this even harder. We recommend consolidating key information and actions into a single panel, along with self-verification tools and clear feedback to reduce the need for manual cross-checking.

System status and next steps

Clear status updates: After submitting a request, users need timely updates to understand system status. In Agent mode, vague notifications make it harder to decide on next steps. We recommend providing clear status updates that separate AI-driven actions from those requiring user input, and adding a “Do Not Disturb” setting to minimize unnecessary interruptions.

“AI programming tools are expanding in functionality, but for users of screen readers, more features don’t mean better usability,” said Nan Chen, research SDE at Microsoft Research Asia – Shanghai. “Complex interfaces, convoluted workflows, and unpredictable feedback reduce efficiency. What’s needed is to deliver more value through fewer actions. Striking the right balance between added features and streamlined usability will be a key challenge for future accessibility design.”

Looking ahead to personalized AI programming

As tools evolve from passive adaptation to active customization, personalization is emerging as a new direction for accessible programming. Users of screen readers have diverse preferences: some want minimal text for quick access to information, while others need richer detail to understand code logic and structure.

“With the learning and adaptation capabilities of large models, AI programming tools can tailor interactions to each user’s traits and habits, becoming a truly personalized assistant,” said Luna Qiu.

These new interaction models and workflows expand the potential of human-AI collaboration and highlight opportunities to improve accessibility. Based on these insights, the research team proposed specific recommendations for more accessible programming.

For example, accessibility design should be built in from the start, not added as a post-launch patch. When screen reader use cases are considered early in the process, accessibility is embedded throughout the product.

Regarding developer support, the focus should go beyond documentation that relies heavily on visuals like screenshots or diagrams. Creating learning materials designed specifically for users of screen readers can lower barriers, improve efficiency, and help more people master AI programming tools, helping them participate more fully in the move toward development using AI-assistants.

The post Microsoft study shows AI assistants help with development for programmers who are blind or have low vision appeared first on Microsoft Research.

]]>
StreamMind: AI system that responds to video in real time http://approjects.co.za/?big=en-us/research/articles/streammind-ai-system-that-responds-to-video-in-real-time/ Fri, 15 Aug 2025 03:00:46 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1147987 Imagine a pair of smart glasses that detects its surroundings and speaks up at critical moments, such as when a car is approaching. That kind of split-second assistance could be transformative for people with low vision, but today’s visual AI assistants often miss those moments. The problem isn’t that the technology can’t detect its environment. […]

The post StreamMind: AI system that responds to video in real time appeared first on Microsoft Research.

]]>
Imagine a pair of smart glasses that detects its surroundings and speaks up at critical moments, such as when a car is approaching. That kind of split-second assistance could be transformative for people with low vision, but today’s visual AI assistants often miss those moments.

The problem isn’t that the technology can’t detect its environment. It’s that current AI systems get bogged down trying to analyze every single frame of video, dozens per second, slowing themselves down in the process. By the time they recognize what’s happening, the moment for helpful intervention has passed.

Now, researchers from Microsoft Research Asia and Nanjing University have designed a system aimed at overcoming this limitation. Their model, called StreamMind, processes video more like a human brain, skimming over uneventful moments and focusing only when something important occurs. The result is video processing that’s up to ten times faster, quick enough to respond as events unfold.

A brain-inspired approach

The key insight is surprisingly simple: instead of analyzing every frame, StreamMind uses an event-gated network that separates fast perception from deeper analysis (Figure 1).

A lightweight system continuously scans video for changes. Only when something meaningful occurs, like a car entering a crosswalk, does it trigger a more powerful large language model (LLM). This decoupling lets the perception module run at video speed, while the cognition module, the LLM, activates only when needed. By removing unneeded computation, StreamMind can keep pace with the video stream, maintaining real-time awareness of its environment.

diagram
Figure 1. Traditional streaming video framework (left) versus StreamMind’s event-gated, decoupled perception and cognition modules (right).

Demonstrations: StreamMind in action

In demonstrations, StreamMind provides responses that match the timing of the event, while current methods lagged. It kept pace with a soccer match, providing smooth play‑by‑play commentary, and guided a cook through a recipe step by step.

Video 1. Navigation assistance: When compared with current methods, StreamMind responds as events occur, while other methods react noticeably later.
Video 2. Sports commentary: In a live soccer match, it keeps up with the flow of play and delivers timely narration.
Video 3. Cooking guidance: In a kitchen setting, the model provides instructions step-by-step, keeping pace with the action.

How the technology works

StreamMind combines two key innovations to enable real-time video perception and response:

Smart memory system

The Event Perception Feature Extractor (EPFE) addresses the biggest bottleneck in current video AI models: how to handle incoming frames in real time without getting overwhelmed. It uses a state‑space model—a method for tracking how data streams (such as video, audio, or sensor inputs) change over time—to extract patterns from long, continuous input. This allows the EPFE to remember key events using just one compact piece of information, called a perception token, and enables the system to efficiently keep pace with the video stream.

Intelligent decision making

The second component determines whether what’s occurring in the video is relevant to the user’s request and whether the assistant should respond. This is a challenge because often there’s no direct connection between a user’s request and individual video frames. For example, a request like “help me fix my bike” requires understanding when to jump in with assistance.

To make those judgments, StreamMind draws on knowledge from an LLM to recognize when events are relevant and a response is needed. A small gating network, combined with a compact one-token summary of the video input, allows StreamMind to monitor events in real time and autonomously call on the LLM when it is time to act.

diagram
Figure 2: StreamMind architecture. EPFE (blue) continuously extracts video features. The gating network (labeled “Cognition Gate” in red) decides whether to invoke the large model.

Testing shows major speed gains

When evaluated against existing methods, StreamMind’s processing speed surpassed all other systems at every tested video speed. Even for fast 100-fps gaming video streams, it kept up with every frame in real time, something no previous system could manage (Figure 3).

chart, bar chart
Figure 3. Frames per second (FPS): This chart shows the time it took for StreamMind as well as two popular video models to process one second of streaming video at different speeds (A100 GPU). StreamMind (the third bar in orange) achieves 100-fps processing speed.

The researchers tested StreamMind in a range of scenarios, including online video commentary, predicting what would happen next in a video, and recognizing complex tasks like changing a tire or cooking. They used large datasets such as Ego4D (3,670 hours of first-person video from 923 participants across 74 locations), SoccerNet (videos of 12 European soccer matches), and COIN (11,827 instructional videos across 12 different subjects). The following tables show the detailed results of these tests.

table
Table 1. Results from theEgo4D and SoccerNet experiments
table
Table 2: Ego4D LTA dataset experiments
table
Table 3: COIN dataset experiments

Across all tests comparing SteamMind’s timing alignment and language modeling capabilities to those of existing streaming dialogue models, StreamMind delivered the best results, demonstrating that it can handle complex, fast-changing, real-world scenarios.

From lab to real life

StreamMind’s event-driven design could make wearable AI systems more responsive, allowing smart glasses and similar devices to react to important events as they happen rather than after the fact. By focusing on the moments that matter, rather than every frame, it could make smart glasses and similar devices far more responsive—able to guide, warn, and assist in step with real-world events.

The post StreamMind: AI system that responds to video in real time appeared first on Microsoft Research.

]]>
TimeCraft: A universal framework for time-series generation http://approjects.co.za/?big=en-us/research/articles/timecraft-a-universal-framework-for-time-series-generation/ Mon, 04 Aug 2025 09:31:57 +0000 http://approjects.co.za/?big=en-us/research/?post_type=msr-blog-post&p=1146392 Time-series data—measurements collected at regular intervals, like stock prices or traffic flows—has become a key driver of intelligent decision-making systems across industries. From medical monitoring to financial risk control, identifying patterns in this data is essential to many important operations. At the same time, the creation of time-series data, or data synthesis, is gaining momentum […]

The post TimeCraft: A universal framework for time-series generation appeared first on Microsoft Research.

]]>
Time-series data—measurements collected at regular intervals, like stock prices or traffic flows—has become a key driver of intelligent decision-making systems across industries. From medical monitoring to financial risk control, identifying patterns in this data is essential to many important operations.

At the same time, the creation of time-series data, or data synthesis, is gaining momentum as organizations grapple with scarcity of real-world data, privacy protection, and the need to test a variety of different scenarios without exposing themselves to risk. AI-generated synthetic data simulates realistic patterns in a risk-free environment. It enables researchers to explore hypothetical scenarios and train models to make decisions in high-stakes contexts.

Yet many of these models fall short of what’s needed. To be truly practical, a generator of time-series data must adapt across different industries and data patterns, offer precise control over trends and volatility and produce data that is realistic and reliable enough to support accurate modeling and analysis.

Microsoft Research Asia developed TimeCraft (opens in new tab) to address this need. This open-source framework creates synthetic time-series data that can be used across different industries and scaled up for commercial applications. Users control data generation through simple written commands, and the system can adapt to different business needs, whether companies want to analyze existing patterns or create data for specific goals.

Three ways to guide generation

TimeCraft’s user interface is build for flexibility. Users can guide date generation through three distinct methods:

  • Few-shot adaptation: Users can to upload a small set of unlabeled samples from the target domain. TimeCraft learns structural features from these samples and generates high-quality data, no retraining or labels required.
  • Natural language control: Users can describe their desired time series in plain language, such as “stable early on, followed by sharp fluctuations.” TimeCraft interprets the prompt and produces data accordingly.
  • Task model feedback: Users can integrate their models—like a disease predictor or market trend detector—into the data creation process. TimeCraft dynamically adjusts the output based on model feedback, optimizing the data for performance.

These methods can be used independently or together, allowing users to generate data that aligns with specific goals, scenarios, or operational needs.

diagram
Figure 1. Overview of the TimeCraft architecture

One model, many industries

TimeCraft works across multiples industries—where each type of time-series data follows distinct patterns—with a unified approach built around semantic prototypes. These are shared representations of time-series structures serve as a universal vocabulary.

When users provide a few example time-series sequences from their specific industry, the Prototype Assignment Module (PAM) maps them to the prototype space, calculating optimal combinations. This industry-specific input guides the model to generate structurally aligned data, no labels or retraining needed.

The result is a system that can rapidly adapt to new scenarios in fields such as energy, healthcare, finance, and transportation, demonstrating strong structural transfer and generalization.

Text-controlled generation: One sentence guides the model

In many real-world scenarios, users know what kind of data they need but don’t have access to enough relevant examples. A typical request might be: “I want a time series that slowly rises for a few days, drops around day 10, and then fluctuates.” These types of needs often arise in fields like healthcare and finance, where designing and testing systems with realistic data is essential but data access is limited.

TimeCraft makes it possible to generate this kind of tailored data using plain language. Instead of relying on specialized tools or existing datasets, users can simply describe the pattern they’re looking for, and the system creates data that fits.

It does this using a collaborative training process involving multiple AI agents. It collects phrasing from real-world industry reports, fills in details using actual data statistics, and refines the wording until the descriptions match the data both clearly and accurately.

When a user submits a description, TimeCraft translates it into guidance for its generative model, enabling direct input, even from users without technical expertise. This makes the tool especially useful in situations where data is scarce or constantly changing. By bridging the user’s intent with the model’s capabilities, TimeCraft makes custom data generation as simple as writing a sentence. This process is illustrated in Figure 2.

diagram, schematic
Figure 2. TimeCraft text-to-time series module consists of: (top) a multi-agent system that creates pairs of plain-language descriptions and matching time-series data, and (bottom) a hybrid mechanism that turns user-written descriptions into synthetic time-series data.

Task-aware generation: Optimized for real-world impact

Most generation models focus on producing realistic data. TimeCraft goes a step further, generating data that improves performance of downstream applications—whether it’s detecting disease trends or modeling market behavior.

This is possible thanks to TimeCraft’s task-aware generation framework. Users can integrate their existing models directly into the data-creation process. The system then uses feedback from these models to guide the direction of data generation in real time, so the output isn’t just realistic, it’s useful.

At the core of this method is a technique called influence scoring, which estimates how each piece of generated data affects a model’s performance. TimeCraft uses these scores to guide the generation process, helping the system produce data with the greatest potential to improve results. This process is shown in Figure 3.

map
Figure 3. Influence scoring process within the TimeCraft framework

This approach is especially helpful in cases where certain patterns are rare or critically important. For instance, in medical diagnosis, TimeCraft can focus on generating a small set of patterns that meaningfully improve prediction accuracy.

By shifting the goal from simulating data to generating data that actively improves outcomes, TimeCraft turns synthetic data into a strategic tool.

Built for real-world use, now open source

TimeCraft was built for real-world applications. It accepts different types of input, adapts to complex use cases, and improves over time using feedback from the tasks it supports. Researchers at Microsoft Research Asia envision it as a comprehensive solution for industries where data is limited, expensive to collect, or sensitive to share—making data generation more targeted, useful, and scalable.

Now open source (opens in new tab), TimeCraft is available for developers, researchers, and business partners around the world to explore, test, and build on.

Related research:

Cross-domain generalization

Controllability

Task adaptability

General techniques

Financial applications

The post TimeCraft: A universal framework for time-series generation appeared first on Microsoft Research.

]]>