Microsoft Research Lab - Asia Articles

Can AI unlock the mysteries of the universe?

Microsoft Research Team — Tue, 25 Mar 2025 07:09:24 +0000

Astronomy, born from humanity’s innate curiosity about the stars, has long been a catalyst for revolutionary discoveries. As AI technology advances, intelligent agents powered by large language models (LLMs) are opening up opportunities for astronomers to explore the universe more effectively and unlock its mysteries.

This technological shift is shaping a discipline that, while specialized, has long driven scientific progress. While astronomy occasionally captures public attention through Nobel Prize-winning discoveries, much of astronomers’ work unfolds far from the spotlight. At its core, the discipline seeks to explain the universe’s vast array of phenomena—from the behavior of hydrogen atoms to the dynamics of entire galaxies.

Two distinct challenges set astronomy apart from other sciences, shaping how astronomers approach discovery:

Extreme physical conditions make direct experimentation impossible, as replicating the environments of celestial bodies in a laboratory is infeasible. This limitation leaves many astronomical theories subject to debate, with even small questions often giving rise to numerous competing hypotheses that may remain unresolved for decades.
Because astronomy has few practical applications, its impact is often indirect. Unlike fields like biotechnology or material science, astronomy rarely yields direct, tangible benefits. Instead, its value often lies in providing foundational insights that support progress in other disciplines. For example, observations of the Sun’s spectral lines were pivotal to the development of quantum mechanics, while studies of Mercury’s orbit offered the first confirmation of Einstein’s theory of general relativity.

These constraints make explainability crucial in astronomical research, often limiting the use of opaque, black-box AI models. Fortunately, LLMs, pretrained on vast amounts of text, offer a promising solution. They not only possess extensive astronomical knowledge but also demonstrate strong logical reasoning, enabling them to construct causal models to explain observed phenomena.

Against this backdrop, Microsoft Research Asia is pleased to introduce Mephisto, (opens in new tab) an LLM agent designed to analyze high-redshift galaxies observed by the James Webb Space Telescope (JWST). This approach has led to potential explanations for the little red dot (opens in new tab) (LRD) from the early universe, demonstrating how LLMs can function as logical inference engines for scientific discovery and establishes a new paradigm where LLMs act as logical inference engines for discovery.

Did you know: High-redshift galaxies are located very far from Earth. As their light travels through the expanding universe, and their wavelengths are stretched, shifting toward the red end of the spectrum—a phenomenon known as redshift. The redshift value (z) quantifies this effect, representing the ratio of the galaxy’s speed of recession relative to the speed of light. A higher redshift value indicates a galaxy that is farther away and older.

Figure 1. An astronomer’s workflow involves using a space telescope to observe a large number galaxies. Astronomers identify “interesting” phenomena and attempt to explain them through a series of physical models.

Mephisto assists with galaxy data analysis

Analyzing the physical properties of individual galaxies is a fundamental skill in astronomy. It requires a thorough understanding of galaxy formation theories and the ability to interpret vast amounts of observational data. However, even for seasoned astronomers, this process can be time-consuming and labor-intensive.

To streamline this process, Mephisto assists astronomers by analyzing photometric data from distant galaxies, proposing physical models and interacting with CIGALE (opens in new tab) (Code Investigating Galaxy Emission), a commonly used galaxy spectral simulation program. Mephisto detects discrepancies between models and observational data, identifies potential instrumental errors or limitations in the models, iteratively adjusts parameters, and generates multiple explanations for the data.e multiple possible explanations for the observational data.

Beyond its analytical capabilities, Mephisto stands out for its natural language knowledge base and memory. Its domain-specific body of knowledge grows through reinforcement learning as it interacts with astronomers and observational data. This not only enhances Mephisto’s accuracy but also broadens its role in scientific discovery. The knowledge it extracts has real physical significance, reflecting the strengths and limitations of various physical models under different conditions.

Mephisto only inspires new research ideas for experienced astronomers but also serves as a valuable reference for newcomers. By mimicking scientific reasoning processes, Mephisto formalizes hypothesis generation and optimization into a tree-search framework, facilitating more in-depth and systematic analysis.

Figure 2. Mephisto’s reasoning process starts from a simple physical model. Guided by its knowledge and memory base, it iteratively explores and refines its hypotheses to interpret observational data.

Mephisto analyzes JWST LRD data to refine models and hypotheses

Evaluations using diverse data sets on cutting-edge scientific problems demonstrate that Mephisto can continually refine physical models to better align with observational data refining its hypotheses.

Mephisto’s modeling of Spectral Energy Distributions (SEDs)—simplified physical frameworks used to interpret the composition of galaxies—is shown in Figure 3, demonstrating its iterative process of refining its output. Mephisto begins with a basic SED model using data on the light flux emitted by galaxies across different wavelengths to interpret their composition. It then iteratively explores and evaluates possibilities, offering explanations that more closely align with observational data. During this exploration, Mephisto not only narrows the set of hypotheses for current observations but also tests the robustness of its conclusions against different model options.

Figure 3. Mephisto’s scientific report shows its analysis process from the latest JWST LRD observation data for JADES ID 90354.

Astronomers can use these reports to revise observations, refine theoretical models, and deepen their scientific understanding. Figure 4 shows Mephisto’s process for refining its hypotheses.

Figure 4. The evolution of hypotheses proposed by Mephisto, with inference and run depth. The Y-axis shows how well the model and the data fit. As the inference and run depth increase, Mephisto increasingly proposes more and more accurate hypotheses.

Mephisto’s parameter-space exploration of LRD origins

When addressing cutting-edge scientific challenges, such as the LRDs observed by the JWST—an enigmatic class of celestial objects that could reshape astronomers’ understanding of the universe—Mephisto excels, often matching or even exceeding experts in hypothesis exploration. By systematically evaluating all potential explanations for these objects, Mephisto helps astronomers uncover insights that extend beyond the current theoretical framework.

One such example is JADES LRD 79803, a mysterious “little red dot” identified by the JWST, as shown in Figure 5. Named for their red color and compact morphology, these galaxies have a distinct V-shaped SED. The scientific community debates two main hypotheses: these objects could be dusty star-forming galaxies or supermassive black holes with possible dust obscuration.

To investigate their origins, Mephisto systematically explores a range of physical properties—including star formation history, dust content, and black hole activity—within a three-dimensional parameter space of galaxy stellar mass, dust extinction, and black hole influence. Its conclusions closely align with those of astronomers—indicated by the red dots in Figure 5—while also offering a more comprehensive analysis.

Figure 5. Mephisto’s performance on the latest JWST LRD observation data for JADES ID 79803 (an early galaxy from 1.27 billion years after the formation of the universe) is similar to that of astronomers and even more comprehensive.

Abundant in the early universe, these galaxies pose challenges to existing cosmological theories. By continuously analyzing such data, Mephisto helps astronomers refine their models and expand our understanding of the cosmos.

AI: A new collaborative paradigm for astronomical discovery

Mephisto is changing the way astronomers interact with AI. They can directly communicate with it using natural language, sharing their domain knowledge and research requirements without the need to repeatedly train the LLM, which is resource-intensive. The AI, in turn, delivers its findings in the same accessible format. This approach enables seamless knowledge transfer between LLMs and different galaxy spectral simulations, eliminating redundant training.

Mephisto’s reasoning process is grounded in current galaxy formation theories, maintaining a transparent, white-box approach to problem-solving. This interpretability ensures that Mephisto integrates smoothly into the existing scientific research paradigm.

It continuously analyzes vast datasets, adapting and improving while mitigating biases in scientific research. Its ability to autonomously refine hypotheses allows it to challenge conventional models and expand scientific inquiry.

Mephisto can also run on supercomputers, analyzing underutilized photometric galaxy data and delivering valuable insights. Its accessibility extends beyond professional astronomers, enabling amateur researchers and citizen scientists to contribute to discoveries in meaningful ways.

Looking ahead, LLMs will evolve into even more sophisticated reasoning engines, automating complex scientific analysis across all areas of astronomy. This approach promises to accelerate progress in the field, enabling AI systems to collaborate with astronomers in pushing the frontiers of our understanding of the universe.

The post Can AI unlock the mysteries of the universe? appeared first on Microsoft Research.

Navigating different cultures: A heart-focused journey

Microsoft Research Team — Tue, 11 Mar 2025 03:15:29 +0000

On the world’s vast canvas, cultures from different regions are like unique brushstrokes, each conveying distinctive perspectives and styles. At Microsoft Research Asia, women researchers from diverse backgrounds have experienced the dynamic interplay of these cultures. Driven by curiosity and a passion for exploration, they move between cultures, gaining rich insights and deepening their understanding.

Today, they are active in various labs at Microsoft Research Asia, immersed in an inclusive research environment shaped by cultural diversity. Their experiences—past and present—not only shape who they are but also pave the way for their future as they forge extraordinary paths.

Discovering the world through personal journeys

For many, their cross-cultural journeys began during their student years. Some moved to distant cities; others crossed oceans. These experiences broadened their perspectives, revealing the vastness and diversity of the world. In honor of International Women’s Day, four women researchers at Microsoft Research Asia—Li Lyna Zhang, Namiko Saito, Jiamin Lee, and Xufang Luo—share their unique journeys.

Li Lyna Zhang: I grew up in Shucheng, Anhui Province of China, a place full of stories. Legend has it that this town was once home to the famous general Yu Zhou from the late Han Dynasty. As I grew older, my interests began to shift towards technology, and it was during the summer of my junior year in college that I landed my first internship at Microsoft Research Asia. This opportunity not only broadened my horizons internationally but also allowed me to maintain close ties with my roots. It led to the chance to pursue a joint Ph.D. program, which further enriched my perspective.

My journey eventually took me to the Beijing lab of Microsoft Research Asia, in the Haidan District. Despite jokes about the area being “boring and lacking good food,” I discovered the perfect blend of international freedom and local pragmatism, fostering the ideal environment for technological innovation and breakthroughs.

Namiko Saito: My childhood in Gunma Prefecture, Japan, was spent close to nature. Moving to bustling Tokyo to an all-girls middle school marked a significant change, as did my transition to the male-dominated mechanical engineering program at Waseda University. Today, one of my motivations at work is to inspire more young women to explore robotics and computer science, discover the infinite possibilities in these fields, and create more opportunities for themselves.

As a child, I loved making crafts, and watching robot competitions on TV sparked my interest in robotics. At university, I joined the robotics competition team, immersing myself in every aspect of robot creation—designing CAD models, fabricating parts, assembling robots, creating circuits, and programming them. These hands-on experiences deepened my love for the field.

During this time, I gained diverse academic and practical experiences both in Japan and internationally, including two years of postdoctoral research at the University of Edinburgh. The cultural differences I encountered were striking. Eastern culture, I observed, emphasizes harmony and considers others’ feelings, while Western culture encourages more direct communication and open discussion. The work styles also differ; Tokyo’s emphasis on punctuality, organization, and meticulous planning contrasted with Edinburgh’s more flexible approach to work arrangements.

These experiences across different cultures helped me realize that differences are not problems to be solved but perspectives to be understood. We should learn to appreciate different viewpoints in our communication and avoid biases.

Jiamin Li: My growth keyword is “migration.” I was born in Akesu, China, and moved to Shanghai with my family at age two. I pursued my bachelor’s and Ph.D. degrees in Hong Kong, interned at a company in Beijing for over a year, and spent nine months as a visiting student in Austin, Texas. After completing my studies, I joined Microsoft Research Asia in Vancouver.

Living in different places has been an excellent way for me to understand myself. Each move has shaped me in unique ways, helping me clarify my interests and inner desires. While moving is physically exhausting and a psychological challenge, forcing me out of my comfort zone, I’ve learned to cope by treating it as an ordinary part of life.

Xufang Luo: I was born in Changsha, Hunan Province of China, and spent my teenage years in this vibrant and lively city. With a thirst for knowledge, I went to Beijing for my bachelor’s and Ph.D. degrees. During my Ph.D. studies, I was fortunate to join the joint Ph.D. program at Microsoft Research Asia and started my internship at the research institute. After graduating, I chose to continue working at Microsoft Research Asia and joined the AI and Machine Learning group at the Shanghai lab.

Li Lyna Zhang, Principal Researcher of Microsoft Research Asia

Overcoming challenges and driving innovation through research

All the researchers featured in this article have different visions for the future of computer science, yet they share a common goal: creating a better future for humanity.

Xufang Luo: One of my research directions involves building powerful multimodal models and broadly applicable solutions to challenges in the fields of medicine and healthcare. This interdisciplinary research often faces information barriers between disciplines. For example, when communicating with doctors or biology experts, we first need to understand their problems, determine whether machine learning algorithms and models can provide effective solutions, and then explain the impact of these solutions.

This process has taught me that AI’s value lies not only in achieving better benchmark scores but also in truly understanding and solving real-world problems to benefit humanity. I hope our research can advance machine learning’s long-term development in medicine and healthcare, bringing meaningful improvements to people’s well-being.

Jiamin Li: My current research direction requires close collaboration with architects and hardware designers, focusing on more fundamental aspects compared with my Ph.D. studies. The practical experience I have gained at Microsoft Research Asia has deepened my understanding of hardware, no longer limited to book knowledge. Through hands-on experience, I have touched the pulse of technology and have developed insights far beyond what is available in books.

Now, I can view entire system architectures holistically, understanding how components interact within complex systems. This comprehensive perspective gives me a deeper understanding of the operations of AI systems. I feel honored and excited to contribute to developing the next generation of AI infrastructure—an effort that represents not just a technological leap but also a step toward enabling AI to better serve humanity.

Li Lyna Zhang: Although current research results have significantly improved the reasoning ability of large models, we’re still far from achieving true general AI. I look forward to continuing to explore the potential of large models’ “IQ and reasoning capabilities with my team, pushing the boundaries of the “intelligence ceiling.” I hope our research can promote the true development of AI, bringing more benefits to society as a whole.

Namiko Saito: AI has already become an integral part of our lives, and I believe embodied AI will be the next important development. This technology will equip AI with powerful physical interaction capabilities, enabling more natural and meaningful human-AI interactions. Breakthroughs in this area will not only bring more convenience but also greatly improve our quality of life.

Xufang Luo, Senior Researcher of Microsoft Research Asia

Pursuing scientific breakthroughs in a diverse cultural environment

In a diverse and inclusive environment, these researchers have the freedom to follow their passions, exploring the boundless world of scientific discovery and advancing the frontiers of science.

Jiamin Li: My research journey began in Hong Kong, a vibrant and fast-paced city, and the research atmosphere was the same. In North America, I experienced a different style—more relaxed, but equally competitive. Surrounded by breakthrough achievements from top computer talents worldwide, it’s hard not to feel some pressure, but this also became my motivation to keep moving forward.

Vancouver is the most culturally diverse city I have lived in. People here are open-minded and inclusive. This diversity is also reflected at the Vancouver lab, where I work with colleagues from different academic backgrounds. Initially, this diversity presented challenges in having a shared understanding of certain basic concepts. But over time, this actually helped me improve my understanding and ability to express my ideas clearly.

I believe that research is essentially a neutral endeavor. When discussing research content, personal background, cultural differences, and other factors all take a back seat. What truly matters is the ideas themselves—the persuasiveness of arguments, clarity of reasoning, and depth of insights. Although we come from different places and have different backgrounds, we all share a common goal: to advance research.

Xufang Luo: After joining the Shanghai lab, I was immediately struck by the strong atmosphere of diversity. My research group alone includes researchers from multiple disciplines, such as computer vision, natural language processing, theoretical research, and reinforcement learning.

Collaborating with colleagues from different backgrounds always brings unexpected benefits. For example, colleagues researching systems share the latest advances in model optimization, deepening my understanding of technical details. Colleagues studying wireless systems inspire me with their cool demos of new technologies. Even colleagues in the AI for Science group engage me in cutting-edge scientific discussions beyond the computer field. The diverse composition of my team allows us to view problems from different perspectives, which I believe is very helpful for fostering innovation.

Namiko Saito: Last August, I found myself at a crossroads, contemplating whether to remain in the UK when I came across recruitment information for Microsoft Research Asia – Tokyo. After careful consideration of how different choices would impact my future development, I firmly believed that joining Microsoft Research Asia presented the perfect opportunity.

I am very eager to participate in the establishment of the new Tokyo lab, especially with the embodied AI project. My research, which focuses on robot control and machine learning, aims to enable robots to perceive their environment and plan actions through real-world interactions. Microsoft Research Asia’s philosophy aligns closely with my goals, giving me the confidence that I can fully integrate my passion and expertise to realize my value here.

Upon arriving at the Tokyo lab, I was immediately drawn to Microsoft Research Asia’s cultural environment. It emphasizes collaboration and communication, with frequent face-to-face discussions about research directions, methods, and strategies. This open environment complements the emphasis on harmony and organization in Japanese research culture, creating a unique balance that promotes teamwork while fully respecting different viewpoints.

Li Lyna Zhang: In the later stages of my Ph.D., I developed a growing interest in transitioning to the field of AI, even though I lacked the relevant experience in this area. Among the job offers I received, Microsoft Research Asia stood out because it does not limit research directions. Here, besides conducting cutting-edge research, the emphasis is on providing researchers with the freedom to explore. This environment allows me to pursue the direction I truly love.

Moreover, Microsoft Research Asia encourages bold risk-taking and exploration, even tolerating failure. For example, in January, our team achieved a breakthrough with rStar-Math, advancing the mathematical reasoning ability of large models. In the early stages, we faced numerous challenges and limitations, but colleagues from different research groups extended a helping hand, working together to push the experiment forward. This open and supportive research atmosphere allows every good idea to take root, blossom, and bear fruit.

Namiko Saito, Senior Researcher of Microsoft Research Asia

Shining in their unique way outside of work

These researchers are talented and down-to-earth, balancing their demanding professional lives with personal hobbies and challenges. Drawing from these experiences, they offer meaningful advice to other women navigating their careers in research.

Namiko Saito: I am always willing to try new things. During my internship in Italy, I joined a local football team; in Edinburgh, I participated in art clubs, dance groups, and even tried archery and rock climbing. I also took part in street singing. Currently, I am a member of the company’s yoga club.

These experiences enrich my life and help me connect with people from different backgrounds and cultures. Sometimes, my hobbies inspire my work. For example, my love for cooking led me to participate in a project teaching robots how to serve soup and make scrambled eggs.

Whether at work or in life, I believe it is important to take the first step towards your interests. Don’t worry about the reasons—just take the leap, and you might meet great partners and gain a lot on your journey!

Li Lyna Zhang: Outside my busy research schedule, I enjoy switching contexts to recharge in completely different ways. Fitness has always been a part of my routine—it keeps me healthy, clears my mind, and allows me to focus on the moment. I truly appreciate these brief moments of relaxation.

Additionally, I love watching shows and reading, as they immerse me in different stories and perspectives, letting me experience the lives of different people. This ability to explore the world from different perspectives not only helps me unwind but also deepens my understanding of life.

Inspired by the philosophy of Lee Hyori, whom I admire, I’ve come to realize that true happiness isn’t just about chasing results but about discovering what you truly want and enjoying the journey. Loving life and loving yourself are among the most valuable skills to cultivate.

Jiamin Li: Reading has been my consistent hobby over the years. I especially love philosophy and history; they help me perceive the world more clearly when my thoughts become blurred. The world has existed for thousands of years, and the path I walk has actually been trodden by countless others before me. Reading provides a gateway into the thoughts of others, reminding me that I am not alone on this journey.

I also feel fortunate to live in an era where female consciousness is rising and being clearly expressed. I believe that women have unlimited potential and abilities, can stand out in any field, and that we are witnessing all this becoming a reality.

Xufang Luo: The story of the “50-year-old aunt’s self-driving tour” deeply touched me, encouraging me to pursue what I truly want. Traveling has always been my greatest hobby, and in the future, I might find more reasons to “go out.”

Life should not be a unvaried path but a picture composed of countless different scenes. We don’t always need to compare ourselves to others; making ourselves happy is what matters most.

The post Navigating different cultures: A heart-focused journey appeared first on Microsoft Research.

Efficiently generating long, high-quality, and dynamic videos using text prompts

Microsoft Research Team — Thu, 27 Feb 2025 05:49:41 +0000

The rapid development of AI has steadily advanced the field of text-to-video (T2V) generation, offering a rich and convenient video content creation experience and unlocking new possibilities in entertainment, education, and multimedia communication. Traditional T2V methods, however, are limited due to a lack of data and computational resources, making it difficult to generate long videos (longer than 30 seconds) that contain dynamic content and temporal consistency. Achieving coherence and preserving the dynamics when generating long videos while also improving efficiency has become a key focus in this field.

To address this, a research team at Microsoft Research Asia has developed the ARLON framework, which combines autoregressive (AR) models with diffusion transformer (DiT) technology. By using vector quantized variational autoencoder (VQ-VAE) technology, ARLON effectively compresses and quantizes high-dimensional input features in T2V tasks, reducing learning complexity without compromising information density. With text prompts, ARLON synthesizes high-quality videos that retain both rich dynamics and temporal coherence.

Figure 1. ARLON’s inference framework

They optimized the ARLON framework by introducing an adaptive semantic injection module and an uncertainty sampling strategy, enhancing the model’s robustness to noise and improving the efficiency of video generation. The adaptive semantic injection module uses a gated adaptive normalization mechanism to inject coarse semantic information into the video generation process. Meanwhile, an uncertainty sampling strategy simulates errors in AR predictions by sampling noise from the distribution of the original coarse latent features, improving the model’s adaptability to different input conditions.

Evaluation demonstrates that ARLON can significantly outperform earlier video generation models in robustness, naturalness, and dynamic consistency. Even when handling highly complex or repetitive scenes, it can consistently synthesize high-quality videos. Using the VBench video generation benchmark, ARLON surpassed existing baseline models and achieved groundbreaking progress across multiple evaluation metrics. The success of the ARLON framework demonstrates the potential of combining the strengths of different models to solve complex problems and offers new directions for advancing video generation technology.

How ARLON enhances the efficiency and quality of long video generation

The ARLON framework is composed of three primary components: latent VQ-VAE compression, AR modeling, and semantic-aware condition generation. Given a text prompt, the AR model predicts coarse visual latent tokens, constructed from a 3D VAE encoder followed by a latent VQ-VAE encoder. These predicted visual latent tokens encapsulate both coarse spatial information and consistent semantic information. Based on these tokens, a latent VQ-VAE decoder generates continuous latent features, which serve as semantic conditions to guide the DiT model with a semantic injection module.

These components are described in detail below:

Latent VQ-VAE compression is a crucial step maps high-dimensional input features into a compact and discrete latent space. The process is achieved through the following expression:

where X∈R^(T×H×W×C) represents the input features, E_”latent” is the encoder composed of 3D convolutional neural network blocks and residual attention blocks, and V∈R^(T/r×H/o×W/o×h) is the encoded latent embedding. Each embedding vector v∈R^h is quantized to the nearest entry c∈R^m in the codebook C∈R^(K×m), forming the discrete latent embedding (Q):

The decoding process involves retrieving the corresponding entries (c) from the codebook (C) given the indices of the video tokens and then using the latent VQ-VAE decoder to reconstruct the video embeddings (F):

AR modeling uses a causal transformer decoder as a language model, combining the text condition Y and the indices of visual tokens Q as the input to the model to generate video content in an AR manner. This process can be described by the following probabilistic model:

where Q_”AR” =[q_1,q_2,…,q_N ] is the sequence of visual token indices, and N is the sequence length. Θ“AR” represents the model parameters. The objective of the model is to maximize the probability of the visual token index sequence Q“AR” given the text condition Y.

In the semantic-aware condition generation phase, the ARLON framework utilizes a video VAE and a latent VQ-VAE to compress the video into a coarse latent space. It uses the tokens predicted by the AR model as semantic conditions for training the diffusion model. This process can be represented by:

where x is the input video, E_”video” is the video encoder, E_”latent” is the latent VQ-VAE encoder, D_”latent” is the latent VQ-VAE decoder, and F is the reconstructed latent feature used as the semantic condition.

Semantic injection involves injecting coarse semantic information into the video generation process to guide the diffusion process. This involves the following steps:

where X_i is the input latent variable, F ̂_i is the condition latent variable processed by uncertainty sampling, α_i,β_i,γ_iare the scale, shift, and gating parameters generated by the multi-layer perceptron (MLP ) network, and the “Fusion” function injects the condition information into the original latent variable.

Figure 2. Overview of ARLON’s overall framework

To mitigate the inevitable noise introduced during AR inference, the team adopted the following two strategies during the training phase:

Coarse visual latent tokens: Two different compression ratios of latent VQ-VAE for training and inference enhance the diffusion process’s tolerance to noisy AR predictions.

Uncertainty sampling: To simulate the variance of AR predictions, an uncertainty sampling module was introduced. This generates noise from the distribution of the original coarse latent features F_i rather than strictly relying on the original coarse latent features:

where μ_i and σ_i are the mean and standard deviation of the noises, respectively, and F ‾_i=(F_i-μ_i)/σ_i is the normalized feature. σ ̂_i and μ ̂_i are noise vectors sampled from the target feature mean and variance distribution.

Evaluation results

The team assessed ARLON against other open-source text-to-long-video generation models using VBench metrics, such as dynamic degree, aesthetic quality, imaging quality, subject consistency, overall consistency, background consistency, and motion smoothness. ARLON achieved state-of-the-art performance in long video generation, with significant improvements in both inference efficiency and generation quality. The results, shown in Figure 3, demonstrate that ARLON excels across multiple evaluation metrics, particularly in dynamic degree and aesthetic quality.

Figure 3. Comparison of ARLON’s performance against other long video generation methods.

Qualitative results further highlight ARLON’s ability to maintain both dynamism and consistency in generated videos. Unlike models that generate static or nearly motionless videos, ARLON achieves a better balance among dynamic motion, temporal consistency, and natural smoothness. Its videos retain a high level of subject consistency while exhibiting fluid and lifelike motion.

Figure 4. Qualitative comparison between ARLON and other long video generation methods.

Figure 5. A short, two-second video generated by ARLON based on the description, “Misty mountains at sunrise, with the sun casting a warm glow.”

Figure 6. Comparison of long videos of thirty seconds generated by multiple models based on the description, “In a mesmerizing underwater world, schools of tropical fish, including angelfish, clownfish, and tangs, dart gracefully through the water.”

ARLON significantly accelerates the DiT model’s denoising process by using AR-predicted latent features as an effective initialization. While the baseline model requires 30 steps for denoising, ARLON achieves similar performance in just 5 to 10 steps.

Figure 7. Comparison of video quality at different denoising steps.

Additionally, ARLON supports long video generation through progressive text prompts, enabling the model to generate videos based on a series of gradually changing text prompts while preserving the coherence of the video content during prompt transitions.

Figure 8. Comparison of ARLON with other models in progressive, text-based, long video generation.

“An erupting volcano dominates the scene ……” 生成的视频" class="wp-image-1121568"/>

Figure 9. Video generated by ARLON based on progressive multi-text prompts, “A majestic dormant volcano rises in the center,” and “An erupting volcano dominates the scene.”

Note: ARLON (opens in new tab) is a research project. While it can synthesize long videos with dynamic scenes, their realism and naturalness depend on factors such as the length, quality, and context of the video prompts. The model carries potential risks of misuse, including forging video content or impersonating specific scenes. In video generation research, applying the model to new, real-world scenarios requires agreements with relevant stakeholders for the use of video content and the integration of synthetic video detection models. If you suspect that ARLON is being misused, used illegally, or infringing on your rights or the rights of others, report it through the Microsoft abuse reporting portal (opens in new tab).

The rapid development of AI has made trustworthy AI systems an urgent issue. Microsoft has taken proactive measures to anticipate and mitigate risks associated with AI technologies and is committed to promoting the development of AI in accordance with human-centered ethical principles. In 2018, Microsoft introduced six Responsible AI Principles: fairness, inclusiveness, reliability and safety, transparency, privacy and security, and accountability. These principles were later formalized through the Responsible AI Standards, supported by a governance framework to ensure that Microsoft teams integrate them into their daily workflows. Microsoft is continuing to collaborate with researchers and academic institutions worldwide to advance responsible AI practices and technologies.

The post Efficiently generating long, high-quality, and dynamic videos using text prompts appeared first on Microsoft Research.

MobiCom 2024 highlights from Microsoft Research Asia: Exploring innovations in wireless mobile technology and applications

Microsoft Research Team — Mon, 10 Feb 2025 02:29:04 +0000

MobiCom is one of the premier international academic conferences in the field of mobile computing and wireless networks. In this article, we select several papers from Microsoft Research Asia that were accepted at MobiCom 2024. These papers explore a diverse range of topics, including mobile task automation, remote auscultation, DNN inference, gas sensing, passive sensing, wireless sensing, and more.

AutoDroid: LLM-powered Task Automation in Android

Mobile task automation is an attractive technique that aims to enable voice-based, hands-free user interaction with smartphones. However, existing approaches suffer from poor scalability due to limited language understanding capabilities and non-trivial manual effort by developers or end-users. Recent advances in large language models (LLMs) for language understanding and reasoning inspire us to rethink the problem from a model-centric perspective, where task preparation, comprehension, and execution are handled by a unified language model. In this work, researchers introduce AutoDroid, a mobile task automation system that can handle arbitrary tasks on any Android application without manual effort. The key insight is to combine the commonsense knowledge of LLMs and the domain-specific knowledge of apps through automated dynamic analysis. Key components include a functionality-aware UI representation method that bridges the UI with the LLM, exploration-based memory injection techniques that augment the LLM’s app-specific domain knowledge, and a multi-granularity query optimization module that reduces the cost of model inference. Researchers integrate AutoDroid with off-the-shelf LLMs, including online GPT-4/GPT-3.5 and on-device Vicuna, and evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks. The results demonstrated that AutoDroid is able to precisely generate actions with an accuracy of 90.9%, and complete tasks with a success rate of 71.3%, outperforming the GPT-4-powered baselines by 36.4% and 39.7%, respectively.

Figure: The workflow of AutoDroid

Exploring the Feasibility of Remote Cardiac Auscultation Using Earphones

Remote video consultations offer convenient and accessible medical assessments from home. However, they can’t currently assess heart health through cardiac auscultation. To solve this problem, in the paper “Exploring the Feasibility of Remote Cardiac Auscultation Using Earphones“, researchers introduce “Asclepius”, which transforms earphones into stethoscopes, allowing doctors to hear heart sounds (PCG signals) during video calls.

Figure: Overview of Asclepius

Asclepius uses a low-cost peripheral to convert earphone speakers into microphones, capturing faint PCG signals in the ear canal. The system includes signal processing algorithms to eliminate reverberation and correct distortions. It uses the MAX5402EUA chip for impedance matching and voltage detection, ensuring compatibility with various earphones and devices.

Implemented on a double-layer PCB board and following IRB protocols, Asclepius was tested with 30 volunteers. Results show it effectively recovers PCG signals from different earphones, excelling in signal preprocessing, segmentation, and two-stage recovery using UNet models. This technology could enhance remote medical services by enabling cardiac auscultation through ordinary earphones.

FlexNN: Efficient and Adaptive DNN Inferenceon Memory-Constrained Edge Devices

Deep neural network (DNN) models are increasingly deployed on customer devices, such as smartphones, autonomous vehicles, robotics, and drones. However, the limited growth in memory capacity, combined with memory sharing in multi-application environments, has made memory overhead a significant bottleneck for DNN deployment. Due to challenges in memory fragmentation and delays in dynamic memory management, existing DNN inference systems often load all model parameters into memory sequentially. This approach struggles to meet the memory demands.

To address this challenge, researchers introduce FlexNN, a DNN inference framework designed for memory-constrained devices with dynamic memory-hierarchy management. FlexNN formalizes the problem as a time-space 2D bin packing issue and breaks traditional tensor boundaries by employing a fine-grained “slice-load-compute” strategy. This enables concurrent disk loading and computation, drastically reducing memory overhead. Experimental results show that FlexNN reduces memory consumption by 93.81% with only a 3.64% increase in latency, without compromising model accuracy. FlexNN has earned four AE badges for reproducibility and reusability.

Figure: Comparison between different memory management strategies. Our preloading-aware static memory management can simultaneously reduce memory fragments and I/O waiting time. The numbers in tensor blocks represent the order of allocation.

FlexNN is a collaboration work between AIR, Tsinghua University and the Heterogeneous Computing group (HEX) of Microsoft Research Asia. FlexNN is part of a broader effort of HEX to design new virtual memory systems for deep learning models, alongside previous works like Pre-gated MoE and Ripple.

Gastag: A Gas Sensing Paradigm using Graphene-based Tags

To address the challenges of high costs and complex maintenance in traditional methods for detecting explosive and toxic gases, this paper introduces Gastag, a novel gas sensing approach using passive RFID tags. Gastag embeds a small piece of gas-sensitive material into cheap RFID tags. Changes in gas concentration alter the material’s conductivity, affecting the tag’s impedance and received signal, enabling precise gas concentration measurement.

To enhance sensitivity and detection range, the research team developed a new material with high sensitivity and surface area, redesigned the tag antenna, and optimized the placement of the gas-sensitive material to achieve impedance matching. Experiments demonstrate low error rates in gas measurements and extend the operational range to 8.5 meters, enabling large-scale deployment.

Gastag’s innovation lies in transforming commercial RFID tags into gas sensors by quantifying the relationship between gas concentration and signal phase variations. It maintains the tag-reader range and leverages RFID signal frequency diversity to improve sensing accuracy. Tests confirm Gastag’s robust performance in various environments, orientations, and interference conditions, showcasing its effectiveness and wide application potential.

Figure: Operation of an RFID reader and a tag.

GPSense: Passive Sensing with Pervasive GPS Signals

Advancements in wireless sensing technology have utilized signals like Wi-Fi, UWB, and acoustic waves for various tasks. However, these systems face challenges such as limited range and interference. This study proposes using continuous GPS signals for wireless sensing, which do not interfere with communication technologies.

The GPSense system achieves passive wireless sensing through GPS signals, reconstructing amplitude and phase information from raw data collected by commercial GPS receiver modules. Researchers developed models tailored to GPS signals and introduced distributed sensing, enhancing performance by fusing signals from multiple satellites.

Figure: The sensing system based on pervasive and interference-free signals from GNSS satellites

Extensive testing under various conditions verified the system’s robustness. Notably, GPS sensing technology was extended to indoor environments using a low-cost GPS repeater. These experiments demonstrate the GPSense system’s potential in human activity sensing, passive trajectory tracking, and respiration monitoring, proving its effectiveness and adaptability.

MSense: Boosting Wireless Sensing Capability Under Motion Interference

In wireless sensing, a major limitation is that devices and targets must remain stationary during the sensing process, hindering real-life applications. To address this issue, the researchers developed MSense, an innovative solution that enhances wireless sensing under motion interference.

MSense uses commercial millimeter-wave (mmWave) radars and digital beamforming technology to improve reflection signals from the target area. By comparing signals from different body areas, MSense eliminates interference from body and device motions, accurately extracting target motion information. This method works for both periodic and non-periodic motion sensing tasks.

Figure: MmWave radar-based sensing primer.

Experimental results demonstrate MSense’s effectiveness in various applications. In vehicles, it significantly improved the accuracy of detecting driver fatigue indicators like eye blinks, yawns, and nods, while reducing false alarms. For monitoring respiration during motion, MSense accurately estimated respiratory rates in home and gym environments, even detecting changes while running on a treadmill. In gesture recognition on mobile devices, MSense achieved over 93% accuracy. These results validate MSense’s potential in advancing wireless sensing technology.

MuDiS: An Audio-independent, Wide-angle, and Leak-free Multi-directional Speaker

In some public places, the demand for personalized audio experiences is growing. For example, museum visitors may want one-on-one explanations closely tied to the exhibits they are viewing, while gym users may prefer to enjoy their own music without wearing headphones. Traditional speaker systems cannot meet these needs due to audio interference and poor sound direction. Acoustic metasurface technology, with its advanced ability to steer acoustic waves, offers a promising solution by taking sound control to a new level.

Building on this innovation, researchers developed a multi-directional speaker (MuDiS). By using a specially designed acoustic metasurface, MuDiS overcomes the limitations of traditional parametric arrays, such as transducer size and wavefront shape. It generates sound waves that can move in multiple directions with adjustable angles, high concentration, and flexibility. The system also converts ultrasonic waves into audible sounds using air nonlinearity, a mechanism that generates audible frequency difference waves through the nonlinear interaction of two or more ultrasonic waves in the air.

MuDiS has three core functions: independent beam playback, wide-angle digital steering, and leakage suppression. First, the unique metasurface design allows MuDis to connect ultrasonic transducers and reshapes sound waves into an approximately spherical wavefront, enabling a wider dynamic steering angle. An optimized beam-forming algorithm suppresses the sound interference of traditional multibeam systems —used in many speakers and conference room audio systems—improving the user experience. Finally, a nonlinear distortion reduction scheme enhances sound quality.

Figure: Experimental setup of MuDiS

In performance evaluations, researchers verified MuDiS’s effectiveness and generalizability. Its performance reached the level of commercial single-beam projection directional speakers. Compared with the traditional method of using parametric arrays to create multibeams, MuDiS offers significantly improved steering angles and sound fidelity.

In addition to providing personalized audio tours for museum exhibits and customized sound delivery in gyms, MuDiS has a wider range of applications. It could be combined with existing audio-based monitoring systems such as driver fatigue detection in vehicles or respiratory monitoring during exercise. With its precise sound direction and ability to deliver highly focused audio, MuDiS could enhance these applications by reducing interference from background noise, improving the accuracy of audio-based monitoring systems and enabling clearer, more targeted interactions. For example, its ability to focus sound beams could isolate audio signals needed to detect subtle breathing changes during exercise.

The post MobiCom 2024 highlights from Microsoft Research Asia: Exploring innovations in wireless mobile technology and applications appeared first on Microsoft Research.

Microsoft Research Asia 2024 annual technology exhibition: An AI journey into the future

Microsoft Research Team — Fri, 10 Jan 2025 06:58:07 +0000

Exhibition map

Exhibition zone 1: AI foundation: Innovation frontiers

Welcome to the first exhibition zone, where you’ll discover Microsoft Research Asia’s groundbreaking advances in foundational AI research, covering areas like model architecture, algorithm optimization, system networks, and data processing. Together, they push the boundaries of AI’s potential and establish a robust foundation for its evolution.

Booth A: Capability leap – Innovations in model architecture

As AI advances at an unprecedented pace, the limitations of traditional model architectures are becoming apparent. This booth highlights the designs and implementations of novel model architectures created to substantially improve performance.

Booth B: Performance acceleration – Low-bit quantization and efficient inference

To optimize AI model performance while significantly reducing inference costs, Microsoft Research Asia has explored advanced low-bit quantization techniques and inference optimization methods. The goal is to deliver low-cost, highly efficient solutions and make AI an essential component of modern infrastructure, much like water and electricity.

Booth C: Cognitive expansion – Processing long contexts and external knowledge

As large models are applied to a growing number of domains, enhancing their ability to use external knowledge and process ultra-long contexts (lengthy inputs) presents new challenges. This booth explores innovative approaches to improving these capabilities.

Booth D: Foundational platforms – Innovations in systems and infrastructure

In the age of large models, traditional computing systems face unprecedented challenges and new opportunities. Microsoft Research Asia is leading innovations in supercomputing, cloud computing, and distributed systems, laying the foundation for the next-generation of AI infrastructure.

Exhibition zone 2: AI horizons – From research to real-world impact

In the “AI horizons” exhibition zone, we highlight the journey of AI technologies from groundbreaking research to reality. Spanning healthcare, scientific discovery, sustainability, and industrial applications, Microsoft Research Asia’s innovations are helping drive societal progress and improving quality of life.

Booth A: Safeguarding health – Advancing medical precision and personalized care

In the medical and healthcare fields, early diagnosis and personalized treatment are essential for improving patient care and outcomes. This booth showcases how multimodal large models and advanced data analysis tools help healthcare professionals identify disease indicators sooner, enabling timely intervention and more effective treatment.

Booth B: Green pioneers – AI solutions for sustainability

As the global challenge of climate change intensifies, managing and reducing carbon emissions has become a top priority for governments and businesses. AI provides powerful solutions to optimize resource management, minimize environmental impact, and help achieve climate goals.

Booth C: Engines of scientific exploration – From astronomy to geology

This booth highlights how AI equips researchers in astronomy and geology with groundbreaking tools and methodologies, driving advances in scientific exploration.

Booth D: Empowering industries – Foundation models for industrial applications

Rapid advances in AI technologies are unlocking significant commercial value across industries. Microsoft Research Asia has focused on pretraining large models with industrial data with the goal of developing tools with cross-domain capabilities, enabling accurate predictions, efficient decision-making, and intelligent simulations.

Exhibition zone 3: AI future – Harmonizing humanity and AI

Microsoft Research Asia is dedicated to advancing innovations that promote the harmonious coexistence of AI and society. Here, we showcase cutting-edge work in human-machine interaction. We also explore how technology can better serve humanity, drawing inspiration from human intelligence. These efforts aim to drive progress in both technology and society, paving the way for a smarter, more harmonious, and sustainable future.

Booth A: Multimodal generation – Inspiring creativity and expression

By integrating text, images, audio, and video, multimodal technologies enable richer, more natural interactive experiences. Microsoft Research Asia’s tools and innovations enhance creative expression across diverse fields, including art, education, and entertainment.

Booth B: Brain-inspired design – The evolution of AI

Microsoft Research Asia is pioneering brain-inspired AI architectures, inspired by neurons, network layers, and higher-level system structures. These innovations are advancing AI networks toward greater energy efficiency, better performance, and lower power consumption. They also establish the theoretical groundwork and methodologies for developing embodied intelligence.

Booth C: Societal AI – Building a responsible, intelligent future

As large models become increasingly integrated into daily life, they introduce significant technological and societal challenges. Microsoft Research Asia prioritizes societal responsibility in AI, fostering interdisciplinary collaborations to guide its responsible development. The following projects exemplify this commitment:

Special exhibition: Awards

This special exhibition highlights Microsoft Research Asia’s most significant technological milestones of 2024. Recognition from academia and industry reflects our researchers’ dedication and reinforces our commitment to innovation and technological transformation.

Test-of-time award: The following research achievement was honoredfor its enduring significance and profound impact on the field of computer science.

Best paper awards: These research projects represent significant breakthroughs in computer science and lay a solid foundation for future technological development.

Personal honors: Congratulations to colleagues who were recognized for their research, engineering, and community contributions:

Shaping the future of research and collaboration

Thank you for visiting this exhibition! We hope it has been a journey of discovery and a source of inspiration, sparking new ideas about the limitless possibilities of innovation. In this era of rapid technological evolution, Microsoft Research Asia remains committed to openness and collaboration, advancing foundational and applied research to help shape a smarter, more harmonious future.

We deeply appreciate your support over the past year. We wish you continued success and growth in your research endeavors in 2025, and we look forward to embarking on future AI journeys together.

The post Microsoft Research Asia 2024 annual technology exhibition: An AI journey into the future appeared first on Microsoft Research.

RD-Agent: An open-source solution for smarter R&D

Microsoft Research Team — Fri, 03 Jan 2025 07:46:33 +0000

In industry today, research and development (R&D) plays a pivotal role in boosting productivity, especially in the AI era. However, the rapid advance of AI has exposed the limitations of traditional R&D automation methods. These methods often lack the intelligence needed to address the demands of innovative research and complex development tasks, falling short of producing solutions comparable to those devised by human experts. In contrast, experienced researchers rely on deep knowledge to propose new ideas, validate hypotheses, and refine processes through iterative experimentation.

The emergence of large language models (LLMs) offers a way to overcome these challenges and transform data-driven R&D. Trained on vast datasets spanning a wide range of subjects, LLMs are equipped with extensive knowledge and reasoning capabilities that support complex decision-making and enable LLMs to act as intelligent agents in diverse workflows. By autonomously performing tasks and analyzing data, LLMs can significantly increase the efficiency and precision of R&D processes.

LLMs infuse R&D with new intelligence

Researchers from Microsoft Research Asia believe that LLMs hold tremendous potential for advancing innovative research. Their extensive knowledge base enables the generation of novel ideas and hypotheses, while their reasoning abilities facilitate the exploration of new experimental paths and methodologies, driving continuous innovation.

In development, LLMs excel at processing and analyzing data, extracting insights, and identifying patterns. They can also create or leverage agentic tools to handle repetitive and complex tasks, greatly accelerating the development process.

To this end, researchers have developed RD-Agent, an automated research and development tool powered by LLMs. By integrating data-driven R&D systems, RD-Agent harnesses advanced AI to automate innovation and development.

At the heart of RD-Agent is an autonomous agent framework composed of two key components: Research and Development. Research focuses on actively exploring and generating new ideas, while Development implements these ideas. Both components improve through an iterative process, illustrated in Figure 1, ensures the system becomes increasingly effective over time.

Figure 1. AI drives data-driven AI

In practical applications, RD-Agent can perform a variety of functions. It acts as your productive research copilot, following your instructions to automate repetitive tasks, or as a more autonomous data-mining agent, actively proposing ideas to help you achieve better results.

The following are demonstration scenarios supported by RD-Agent, showcasing its capabilities from general research assistance to specialized data intelligence development in various professional fields:

General research assistant (opens in new tab): Automatically reads research papers or reports and implements model structures.
Data pattern identifying: Automatically explores and implements model structures to identify patterns in data from fields like finance (opens in new tab) and healthcare (opens in new tab).
Automated quant factory (opens in new tab): Fully automates time-consuming feature engineering tasks in complex real-world systems.

RD-Agent is now open source on GitHub (opens in new tab), and researchers from Microsoft Research Asia are continuously updating and expanding its features to support more methods and scenarios. These ongoing efforts aim to optimize the development process and boost productivity.

Key challenges and technical innovations

Simply applying LLMs to R&D scenarios fails to yield significant industrial value. To achieve the transformative impact of automating data-driven R&D and harness the capabilities of LLMs, we must address two key challenges: continuous evolution capability and specialized knowledge acquisition.

Current LLMs struggle to expand their capabilities after the initial training phase, and their focus on general knowledge coupled with a lack of depth in specialized knowledge, becomes an obstacle to solving professional R&D problems. Specialized knowledge must be acquired through in-depth industry practice.

To address these challenges, RD-Agent incorporates a dynamic learning process that integrates real-world practice and feedback. This enables in-depth domain knowledge to be continuously acquired through deep exploration during the R&D phase.

To support this, researchers have proposed basic methods across three aspects: research, development, and benchmarking.

Research: Investigating new ideas and refining them through feedback

In the R&D process, proposing and validating new ideas are fundamental components of research. For example, a data mining expert might hypothesize that a model structure like RNN can capture patterns in time-series data. They would design experiments (e.g., testing the hypothesis on financial data), implement the model experiments as code (e.g., using PyTorch), execute the code, and analyze feedback (e.g., metrics, loss curves) to guide improvements for subsequent iterations. This process is illustrated in Figure 2.

Figure 2. Basic methods in the research aspect

Inspired by these general principles, RD-Agent focuses on proposing new hypotheses or refining existing ones, designing and implementing experiments, and analyzing feedback. By establishing a continuous feedback loop, hypotheses are consistently proposed, verified, and refined based on real-world practice. RD-Agent is the first framework that supports linking scientific research automation with real-world verification. It incorporates knowledge management, allowing the agent to continuously verify, acquire, and accumulate knowledge during exploration, much like human experts. Over time, this deep understanding of scenarios enables the development of more optimal solutions.

Development: Efficiently implementing and executing ideas

The development process emphasizes efficiently implementing research outcomes while maximizing benefits by effectively prioritizing tasks. To address this, researchers have introduced Collaborative Knowledge-STudying-Enhanced Evolution by Retrieval (Co-STEER), a data-centric development solution, as a key component of RD-Agent. Co-STEER starts with simple tasks and progressively enhances development strategies through continuous learning, using ongoing feedback to optimize its efficiency. This process is shown in Figure 3.

Figure 3. Automating data-centric development with LLM-Agent

By enhancing domain knowledge through evolving strategies and practical experience, the Co-STEER agent improves its scheduling and implementation skills. This collaborative evolution process ensures faster and more accurate implementation by using detailed feedback to refine both scheduling and strategies. Learn more about Co-STEER in our paper (opens in new tab).

Figure 4. The detailed design of Co-STEER

Benchmark: Developing a new benchmarking system to assess agents’ R&D capabilities

Researchers have developed a comprehensive benchmark called RD²Bench (opens in new tab), designed to evaluate the capabilities of LLM-Agents in model and data development through a series of tasks ranging from data construction to model design.

For model development evaluation, researchers extract information from papers on model structure design, summarize implementation details using mathematical formulas and text, and provide this as input to the development agent. For data development evaluation, they identify the construction of financial features (factors) as a typical and knowledge-dense scenario. Implementation formulas and descriptions of these factors are extracted from publicly available research reports and used as input for the development agent. Each task has a manually implemented correct version, serving as the basis for evaluating the quality of the model and data construction results. This process is illustrated in Figure 5.

Figure 5. Overview of the R&D process

Unlocking the full potential of LLMs

Looking ahead, efficiently automating data science research remains an open question. Additionally, fully unlocking the innovative potential of LLMs to enable cross-domain and cross-disciplinary knowledge transfer, integration, and innovation is a significant challenge. During the development process, automating the understanding of feedback, integrating it with the current development level, and intelligently scheduling and prioritizing tasks to enhance foundational models as agents—these are all critical and challenging research directions.

To address these challenges, the key lies in fostering the simultaneous enhancement of research and development capabilities through practical feedback, enabling their co-evolution. This integrated approach can significantly boost the innovative capacity of LLMs, driving cross-domain and cross-disciplinary knowledge transfer while improving R&D efficiency and quality. Ultimately, it aims to achieve a transformative leap in automated research and development.

Resources

Explore the RD-Agent

RD-Agent webpage (opens in new tab)

View RD-Agent demos

Continuous Exploration
- Finance model implementation demo
- Finance data building
- LLM-based autonomous evolving agents for industrial data-driven R&D (medical model) (opens in new tab)
Guided Implementation
- Finance data building from reports
- LLM-based autonomous evolving agents for industrial data-driven R&D (general model) (opens in new tab)

The post RD-Agent: An open-source solution for smarter R&D appeared first on Microsoft Research.

Low latency carbon budget analysis reveals large decline in land carbon sink (2023)

Microsoft Research Team — Fri, 06 Dec 2024 07:23:07 +0000

Since the Industrial Revolution, the burning of fossil fuels and changes in land use, especially deforestation, have driven the rise in atmospheric carbon dioxide (CO₂). While terrestrial vegetation and oceans serve as natural carbon sinks, absorbing some of this CO₂, emissions have consistently outpaced their annual capacity. This imbalance has led to a continuous rise in atmospheric CO₂concentrations, fueling global warming and extreme weather events. Against this backdrop, accurate and timely carbon budget estimation is essential if we are to achieve carbon neutrality and slow down global warming.

The carbon budget measures the balance of carbon sources and sinks in the global carbon cycle. It includes data from fossil fuel and cement emissions, land use changes, and natural sources and absorptions of CO₂. This helps track changes in atmospheric CO₂ levels. Accurate carbon budgets are essential for understanding and addressing climate change. With growing climate challenges, monitoring carbon sinks and emissions is crucial. As countries work towards carbon peaking and neutrality, the carbon budget is key for scientific research and creating sustainable policies.

Decline in land carbon sinks and the need for low-latency carbon budgeting

Researchers from Microsoft Research Asia, in collaboration with Tsinghua University (opens in new tab), the French Laboratory for Climate and Environmental Sciences (opens in new tab), and other global research organizations, published a study in National Science Review revealing a dramatic decline in global land carbon sinks—the Earth’s land ecosystem that absorb CO₂in 2023. By employing dynamic global vegetation models, satellite fire emissions, OCO-2 satellite measurements, and ocean model emulators, they created a fast-track carbon budget for 2023, identifying unprecedented weakening of these vital land carbon sinks. Figure 1 provides detailed visualizations of these findings.

Figure 1. Atmospheric CO₂ growth rate (1960–2023) and carbon budget (2010–2023). (a) Growth rate data from marine boundary layer surface stations (MBL, blue bars) and Mauna Loa station measurements (MLO, dark blue squares). (b) The global CO₂ budget, derived from historical fossil fuel and cement CO₂ emissions, 2023 estimates of land and ocean sinks, and MBL/MLO CO₂ annual growth rates. Estimates are based on simulations from ocean sink emulators and land sink simulations using three dynamic vegetation models forced by low-latency climate input data, with their mean sink in 2019–2022 adjusted to match the median sink of 16 models used in the latest Global Carbon Budget. Ocean and land sink estimates are based on OCO-2 high-resolution atmospheric inversion. The difference between the stacked bars (bottom) and the red curve (-1 x fossil emissions) represents the budget imbalance.

Traditional carbon budget methods primarily rely on numerical simulations, which, while capable of modeling complex Earth system processes, face significant delays due to high computational demands and slow data updates. For example, the Global Carbon Budget 2023 report, published by the Global Carbon Project in December 2023, includes data only through the end of 2022, resulting in a one-year information lag. This lag not only compromises the accuracy of assessing climate change trends but also hinders timely action to address climate change.

Significant environmental events in 2023

Major environmental events in 2023 revealed the dynamic and unpredictable nature of the carbon cycle, emphasizing the need for near-real-time monitoring. North America experienced widespread wildfires, contributing 0.58 ± 0.10 GtC to atmospheric CO₂. At the same time, a shift from La Niña to a moderate El Niño phase led to significant changes in the global carbon sink. GRACE satellite data also recorded a decline in terrestrial water storage across much of the Northern Hemisphere and the Amazon, exacerbating plant water stress and reducing carbon absorption.

The Amazon rainforest, in particular, faced extreme drought in the second half of the year, resulting in a carbon sink loss of 0.31 ± 0.19 GtC, while tropical Africa experienced wetter-than-usual conditions, altering regional carbon fluxes. These events, which were not captured by traditional methods, underscore the urgency of adopting innovative approaches like the near-real-time AI-based global carbon budget model to provide timely and actionable climate data.

Near-real-time AI global carbon budget method

To address this, Microsoft researchers and its collaborators developed a near-real-time global carbon sink model that incorporates current environmental variable observations, historical data from numerous ocean and land models, and a framework based on convolutional neural networks (CNNs) and semi-supervised models. It can predict near real-time carbon sink data with a loss margin of less than 2%, enabling accurate, low-latency carbon budget predictions. Figure 2 illustrates the methodology behind the near-real-time global carbon sink model, highlighting its integration of data sources and AI-based framework.

Figure 2. Schematic overview of the methodology and data sources used in the near-real-time global carbon sink model.

Cross-disciplinary collaboration for timely carbon budgeting

Philippe Ciais, a researcher at the French Laboratory for Climate and Environmental Sciences and co-author of the study, remarks, “In 2023, the accumulation of CO₂ in the atmosphere was very high, translating into very low absorption by the terrestrial biosphere. In the Northern Hemisphere, which accounts for more than half of CO₂ uptake, we observed a declining trend in absorption for eight years. There is no good reason to believe it will bounce back because of continuing disturbance.”

The rapid decline in the land carbon sink raises concerns about future climate stability, as current prediction models may not fully account for abrupt shifts in carbon sinks. This challenge underscores the need for cross-disciplinary collaboration to drive innovative solutions. The near-real-time carbon budget method, powered by AI, exemplifies this collaboration between Microsoft Research Asia, Tsinghua University, and the French Laboratory of Climate and Environmental Sciences. The project combines expertise from environmental science, Earth system science, ecology, atmospheric science, and AI.

“In our cross-disciplinary collaboration, we deeply value the complementary knowledge each field brings. The data provided by the French Laboratory of Climate and Environmental Sciences laid a solid foundation for our research, while the innovative application of AI enabled real-time monitoring of carbon sinks and emissions,” says Jiang Bian , senior principal researcher at Microsoft Research Asia.

“AI researchers must understand the complexities of environmental science, Earth system science, ecology, and atmospheric science, while experts in these fields must grasp the latest advances in AI and its potential applications. This mutual learning has allowed us to apply AI effectively in environmental monitoring, demonstrating the value of cross-disciplinary collaboration in addressing global challenges.”

Looking forward

As technology and data continue to improve, Microsoft Research Asia plans to further integrate ocean and land carbon budget models, advancing the AI-based Near-real-time Global Carbon Budget (ANGCB) initiative. Researchers aim to enhance the models’ performance and efficiency, delivering more timely and accurate data to support global climate change research and provide valuable information for policymaking. These efforts will help drive progress in global environmental governance and foster the development of a sustainable, ecological future.

The post Low latency carbon budget analysis reveals large decline in land carbon sink (2023) appeared first on Microsoft Research.

Towards industrial foundation models: Integrating large language models with industrial data intelligence

Microsoft Research Team — Thu, 05 Dec 2024 07:28:07 +0000

Although large language models (LLMs) excel in language-focused tasks like news writing, document summarization, customer service, and virtual assistants, they face challenges when it comes to learning and inference on numeric and structured industry data, such as tabular data and time series. To address these issues, researchers from Microsoft Research Asia have proposed a new approach to building industrial foundation models (IFMs). They successfully demonstrated the feasibility and significant potential to achieve cross-domain universal in-context learning on tabular data.

They designed a Generative Tabular Learning (opens in new tab) (GTL) framework that integrates multi-industry zero-shot and few-shot learning capabilities into LLMs. This approach allows the models to adapt and generalize to new fields, new data, and new tasks more effectively, flexibly responding to diverse data science tasks . This technical paradigm has been officially open-sourced (opens in new tab) to promote the broader use of data science across different sectors and make advanced data intelligence accessible to everyone.

Unlocking the vast potential of industrial data

Researchers have identified untapped potential for LLMs in industrial data. Gathered from a wide range of industries and sectors, this data is stored in unique formats, such as tabular data for relational structures, time-series data for temporal patterns, and graph data for complex interconnections. These specialized formats contain valuable insights that are hard to capture in human language, making them largely absent from existing LLMs.

More importantly, industrial data and the intelligence it carries are foundational for critical applications across various domains. In energy storage, identifying patterns in battery cycling data can accelerate the material screening during manufacturing, optimize charge-discharge protocols in usage, and guide value pricing during recycling. In commerce, historical sales and demand data can help forecast future demand and set pricing strategies. Importantly, this kind of intelligence is drawn not only from numerical and structured information but also from task-specific methodologies and domain-specific expertise.

To advance data intelligence applications across industries, researchers at Microsoft Research Asia propose the development of industrial foundation models (IFMs). Their approach involves post-training LLMs on industrial data science tasks, embedding specialized knowledge unique to various sectors and enhancing in-context data learning capabilities. This method aims to create IFMs capable of excelling in diverse data science tasks across industries, extracting data knowledge that spans tasks and domains, and performing predictive and logical reasoning tailored to industrial needs.

Building IFMs with tabular data

To implement the proposed approach, researchers have focused on developing IFMs using tabular data—one of the most common data formats, typically stored in relational databases and widely used across various domains.

The process begins with collecting diverse tabular datasets from various domains and converting them into an instruction-oriented format. This conversion accommodates diverse data schemas, including variations in feature semantics and numerical interpretations, and supports any combination of numerical and categorical features. Additionally, the approach incorporates data samples along with optional metadata, enables both regression and classification tasks, and is designed to handle zero-shot and in-context learning scenarios effectively.

However, integrating the language processing capabilities of LLMs with tabular data learning poses significant challenges. A key issue is that LLMs are generally pretrained on natural language data, leaving them less equipped to handle the nuanced and structured nature of tabular data. Additionally, they often lack the domain-specific knowledge essential for effective learning and inference in tabular formats.

To address these challenges, researchers have introduced a post-training approach called generative tabular learning (GTL). This process enhances LLMs by integrating data knowledge and statistical learning capabilities through an autoregressive generative modeling technique applied to feature and label tokens.

Following this post-training stage, a GTL-enhanced LLM can be directly applied to new industrial data schemas and tasks by adjusting instruction prompts, eliminating the need for complex parameter tuning. The enhanced model can also generalize across diverse domains, data patterns, and tasks, marking a significant step forward in applying LLMs to industrial contexts. Figure 1 illustrates the end-to-end process.

Figure 1. The pipeline for building IFMs on tabular data

GTL significantly boosts LLaMA’s comprehension of tabular data

To evaluate the effectiveness of GTL, researchers compiled over 400 datasets and applied rigorous deduplication filtering, resulting in 384 datasets. Of these, 44 were reserved for evaluation, while the remaining datasets were used to generate over 1,000 distinct prediction tasks for GTL. The study used Meta’s LLaMA 13B model as the base LLM and compared its performance against both open-source and proprietary LLMs, as well as traditional tabular learning algorithms.

Figure 2 summarizes the comparison between GTL-enhanced LLaMA and other baseline models. The results reveal that GTL significantly improves LLaMA’s ability to interpret tabular data. Notably, the GTL-enhanced LLaMA, despite its smaller size, achieves performance that is competitive with—and in some cases superior to—GPT-4. It is worth noting, however, that assessing data contamination risks for GPT-4 is challenging, as its training data includes publicly available web content, which could provide it with an unintended advantage. Additionally, the GTL-enhanced LLaMA shows remarkable in-context learning capabilities, outperforming traditional tabular learning methods in few-shot scenarios.

Figure 2. A summary of the experimental results

The researchers also conducted a preliminary study to investigate scaling laws for GTL. The findings, illustrated in Figure 3, reveal that both dataset diversity and model size drive performance improvements on holdout datasets in a power-law manner. These results highlight the potential of IFMs to generalize effectively across a wide range of tasks and domains, making advanced data intelligence tools accessible even to industries with limited data resources.

Figure 3. A preliminary study of scaling laws

Broadening the horizons of IFMs

GTL has paved the way for conversational tabular data deep learning, enabling users to perform analysis, prediction, reasoning, and decision-making through direct interactions with the model. By integrating GTL with language models, the system not only generates predictive results but also provides explanations, enhancing the interpretability of tabular data learning. Inspired by the potential of this paradigm, researchers at Microsoft Research Asia envision promising directions for foundational industry models.

The first direction involves scaling this approach across multiple dimensions, such as expanding the variety and size of datasets, increasing model sizes, extending context lengths, and incorporating diverse data formats like time-series and graph data. These enhancements will allow IFMs to handle a broader range of tasks and domains with greater precision and adaptability. Additionally, integrating industrial data knowledge with advances in the modern LLM ecosystem—such as tool usage, intelligent agents, and interactive applications—can further increase IFMs’ capabilities. This synergy can lead to more robust, versatile models that seamlessly blend industrial data intelligence with the sophisticated functionalities of contemporary LLMs.

The second direction emphasizes the transformative potential of IFMs in industrial data intelligence. This shift calls for a reimagining of user interfaces and toolchains for data science, paving the way for innovative products and service like data science copilots. These copilots can assist domain experts by providing advanced data analysis and predictive capabilities without requiring deep technical expertise, democratizing access to cutting-edge data science tools. Moreover, IFMs can serve as powerful decision-making tools for business leaders and industry practitioners. By delivering comprehensive insights and personalized analytics, IFMs can help companies make more informed strategic decisions, optimize operations, and identify new opportunities for growth and innovation.

The development of IFMs marks a significant step in merging LLM capabilities with industrial data knowledge. By continuing to scale, innovate, and design user-centric tools, researchers can make advanced data intelligence accessible and unlock innovation across various industries. This vision aligns with researchers’ original goal of leveraging LLMs to complete instruction-centric tasks, extract cross-sector knowledge, and perform industrial predictive and logical reasoning. Moving forward, researchers remain committed to pushing the boundaries of what’s possible, helping unlock new potential for industries worldwide.

The post Towards industrial foundation models: Integrating large language models with industrial data intelligence appeared first on Microsoft Research.

USENIX ATC 2024 best paper | How Microsoft is improving cloud AI infrastructure reliability

Microsoft Research Team — Tue, 22 Oct 2024 03:33:43 +0000

As cloud AI workloads grow in complexity and scale, maintaining high system reliability has become crucial. Traditional methods of ensuring system reliability, such as using redundant components, inadvertently introduce a new problem: subtle performance degradation, also known as “gray failures”. Gray failures are caused by the gradual failure of redundant components and are characterized by a gradual and not easily noticeable decline in performance in the early stages, making them difficult for system administrators to detect. When the redundant components fail completely in the later stages, the system experiences significant performance degradation. This complicates the task of identifying and resolving system failures.

Traditional approaches to system reliability assurance often rely on reactive methods, such as timely troubleshooting for incidents, which cannot effectively address gray failures. Researchers from Microsoft Research Asia and engineers from Microsoft Azure realized that reactive troubleshooting alone is insufficient to tackle this challenge. They developed SuperBench, a proactive validation system for cloud AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. The paper on SuperBench has been accepted by USENIX ATC 2024, the world’s top academic conference in the field of computer systems, and has won the best paper award.

Publication: http://approjects.co.za/?big=en-us/research/publication/superbench/

Project: https://aka.ms/superbench (opens in new tab)

GitHub: https://github.com/microsoft/superbenchmark (opens in new tab)

The design concept of SuperBench focuses on proactive validation rather than reactive response. It is capable of detecting and addressing potential issues before significant performance degradation occurs in the system. This approach not only enhances system reliability but also reduces maintenance costs and performance problems experienced by users.

To effectively minimize the MTBI (mean time between incidents), proactive validation should satisfy the following requirements: (i) Comprehensive: Validation must be comprehensive and encompass a wide range of AI workloads to detect incidents that are undetected by vendors in new clusters and only surface in customer workloads. (ii) Clear-cut: Given that hardware components can exhibit gradual performance degradation and measurements are prone to natural variance, it is essential to establish a clear-cut boundary between defective and normal performance. Repetitions of the same test should yield consistent results, rather than fluctuating between outcomes. (iii) Cost-efficient: Proactive validation necessitates additional measurements, which consume time. Therefore, it must be cost-efficient, ensuring that validation expenses remain significantly lower than the penalties associated with incidents.

Nevertheless, addressing these requirements presents significant challenges. Firstly, the sheer number of workloads and exponential node combinations result in an immense search space for all scenarios, making it impossible to encompass every aspect in the validation process. Secondly, once performance is measured, there is no ground truth available for defective components. Identifying which components are defective is problematic, as hardware specifications cannot reliably predict workload performance. Moreover, AI hardware often exhibits substantial variations, further complicating the differentiation process. Lastly, the validation time and MTBI can be interdependent, as fewer validated components lead to shorter times between incidents. Determining when to validate which components for optimal cost-efficiency, while achieving the longest MTBI with the least measurement time, proves to be a challenging endeavor.

At the heart of SuperBench is its comprehensive benchmark suite, which assesses both individual hardware components and a wide range of real AI workloads. This approach ensures that the system can detect issues that might otherwise remain hidden during normal operation. Key features of SuperBench include:

Comprehensive Benchmark Suite: SuperBench incorporates end-to-end benchmarks for typical AI workloads and micro-benchmarks for individual hardware components. This holistic approach allows for thorough testing and early detection of potential issues.
Selector Module: Designed to optimize validation efforts, the Selector balances the trade-off between validation time and incident-related costs. It employs a probability model to determine the most effective subset of benchmarks to run, ensuring that validation is both efficient and impactful.
Validator Module: This component uses advanced machine learning techniques to analyze benchmark data and pinpoint defective hardware with precision. By focusing on cumulative distribution metrics rather than average values, SuperBench can clearly differentiate between functional and malfunctioning components.

Figure 1: SuperBench system architecture and an example workflow

The effectiveness of SuperBench is underscored by its successful deployment in Azure’s production environment over the past two years. During this period, SuperBench has validated hundreds of thousands of GPUs, identifying 10.36% of nodes as defective and significantly improving system reliability. Simulation results demonstrate that SuperBench can increase MTBI to an impressive 22.61x compared to the absence of validation and 1.11x compared to full set validation without benchmark selection. Additionally, it has increased user GPU hours to 4.81x while reducing validation time costs by 92.07%.

Figure 2: Simulated average node utilization with different benchmark selection policies within 30 days. SuperBench Selector achieves a high cluster utilization of 90.70%, improving the no validation baseline to 4.81x and the full set baseline to 1.09x

The introduction of SuperBench marks a significant advancement in proactive system validation. By addressing the challenges of gray failures and improving the reliability of cloud AI infrastructure, SuperBench not only enhances system performance but also contributes to cost savings and operational efficiency. The open-sourced benchmarks, available on GitHub (opens in new tab), have already been adopted by AI hardware vendors, further broadening the impact of this innovative solution.

The post USENIX ATC 2024 best paper | How Microsoft is improving cloud AI infrastructure reliability appeared first on Microsoft Research.

ProbTS: Unified benchmarking for time-series forecasting

Microsoft Research Team — Sun, 29 Sep 2024 02:31:01 +0000

Author: Machine Learning Group

Time-series forecasting is crucial across various industries, including health, energy, commerce, climate, etc. Accurate forecasts over different prediction horizons are essential for both short-term and long-term planning needs across these domains. For instance, during a public health emergency such as the COVID-19 pandemic, projections of infected cases and fatalities over one to four weeks are essential for allocating medical and societal resources effectively. In the energy sector, precise forecasts of electricity demand on an hourly, daily, weekly, and even monthly basis are crucial for power management and renewable energy scheduling. Logistics relies on forecasting short-term and long-term cargo volumes for adaptive route scheduling and efficient supply chain management.

Beyond covering various prediction horizons, accurate forecasting must extend beyond point estimates to include distributional forecasts that quantify estimation uncertainty. Both the expected estimates and the associated uncertainties are indispensable for subsequent planning and optimization, providing a comprehensive view that informs better decision-making.

Given the critical need for accurate point and distributional forecasting across diverse prediction horizons, researchers from Microsoft Research Asia revisited existing time-series forecasting studies to assess their effectiveness in meeting these essential demands. The review encompasses state-of-the-art models developed across various research threads:

Classical time-series models: These models typically require training from scratch on each dataset, focusing on either long-term point forecasting (e.g., PatchTST, iTransformer) or short-term distributional forecasting (e.g., CSDI, TimeGrad).
Recent time-series foundation models: These models involve universal pre-training across extensive datasets and are developed by both industrial labs (e.g., TimesFM, MOIRAI, Chronos) and academic institutions (e.g., Timer, UniTS).

Despite the advancements, researchers find that existing approaches often lack a holistic consideration of all essential forecasting needs. This limitation results in “biased” methodological designs and unverified performance in untested scenarios.

To address the gaps identified in existing time-series forecasting studies, researchers developed the ProbTS tool. ProbTS serves as a unified benchmarking platform designed to evaluate how well current approaches meet essential forecasting needs. By highlighting crucial methodological differences, ProbTS provides a comprehensive understanding of the strengths and weaknesses of advanced time-series models and unveils opportunities for future research and innovation.

Repo: https://github.com/microsoft/ProbTS (opens in new tab)

Paper: https://arxiv.org/abs/2310.07446v4 (opens in new tab)

Paradigm differences: Methodological analysis of time series forecasting

The benchmark study using ProbTS highlights two crucial methodological differences found in contemporary research: the forecasting paradigms for point and distributional estimation, and the decoding schemes for variable-length forecasting across different horizons.

Forecasting paradigms for point and distributional estimation

Point forecasting only: Approaches that support only point forecasting, providing expected estimates without uncertainty quantification.
Predefined distribution heads: Methods that use predefined distribution heads to generate distributional forecasts, offering a fixed structure for uncertainty estimation.
Neural distribution estimation modules: Techniques employing neural network-based modules to estimate distributions, allowing for more flexible and potentially more accurate uncertainty quantification.

Decoding schemes for variable-length forecasting across different horizons

Autoregressive (AR) methods: These methods generate forecasts step-by-step, using previous predictions as inputs for future time steps. They are suitable for scenarios where sequential dependencies are crucial.
Non-Autoregressive (NAR) methods: These methods produce forecasts for all time steps simultaneously, offering faster predictions and potentially better performance for long-term forecasting.

Figure 1. An overview of ProbTS. illustrating its coverage across diverse forecasting scenarios, including typical models developed in different research branches and comprehensive evaluation metrics.

The research results under the ProbTS framework reveals several key insights:

Firstly, due to customized neural architectures, long-term point forecasting approaches excel in long-term scenarios but struggle in short-term cases and with complex data distributions. The lack of uncertainty quantification leads to significant performance gaps compared to probabilistic models when dealing with complex data distributions. Conversely, short-term probabilistic forecasting methods are proficient in short-term distributional forecasting but exhibit performance degradation and efficiency issues in long-term scenarios.

Secondly, regarding the characteristics of different decoding schemes, NAR decoding is predominantly used in long-term point forecasting models, while short-term probabilistic forecasting models do not show such a biased preference. Meanwhile, AR decoding suffers from error accumulation over extended horizons but may perform better with strong seasonal patterns.

Lastly, for current time-series foundation models, the limitations of AR decoding are reaffirmed in long-term forecasting. Additionally, foundation models show limited support for distributional forecasting, highlighting the need for improved modeling of complex data distributions.

Detailed results and analysis on classical time-series models

Researchers benchmark classical time-series models across a wide range of forecasting scenarios, encompassing both short and long prediction horizons. The evaluation includes both point forecasting metrics (Normalized Mean Absolute Error, NMAE) and distributional forecasting metrics (Continuous Ranked Probability Score, CRPS). Additionally, researchers calculate a non-Gaussianity score to quantify the complexity of data distribution for each forecasting scenario.

Based on the data presented in Figure 2, several noteworthy observations emerge:

Limitations of long-term point forecasting models: Customized neural architectures for time-series, primarily designed for long-term point forecasting, excel in long-term scenarios. However, their architectural benefits significantly diminish in short-term cases (see Figure 2(a) and 2(c)). Furthermore, their inability to quantify forecasting uncertainty results in larger performance gaps compared to probabilistic models, especially when the data distribution is complex (see Figure 2(c) and 2(d)).
Weaknesses of short-term probabilistic forecasting models: Current probabilistic forecasting models, while proficient in short-term distributional forecasting, face challenges in long-term scenarios, as evidenced by significant performance degradations (see Figure 2(a) and 2(b)). In addition to unsatisfactory performance, some models experience severe efficiency issues as the prediction horizon increases.

Figure 2. Benchmark classical time-series models with ProbTS

These observations yield several important implications. Firstly, effective architecture designs for short-term forecasting remain elusive and warrant further research. Secondly, the ability to characterize complex data distributions is crucial, as long-term distributional forecasting presents significant challenges in both performance and efficiency.

Following that, researchers compare Autoregressive (AR) and Non-Autoregressive (NAR) decoding schemes across various forecasting scenarios, highlighting their respective pros and cons in relation to forecasting horizons, trend strength, and seasonality strength.

Figure 3. Comparing AR and NAR schemes with ProbTS

Researchers find that nearly all long-term point forecasting models use the NAR decoding scheme for multi-step outputs, whereas probabilistic forecasting models exhibit a more balanced use of AR and NAR schemes. Researchers aim to elucidate this disparity and highlight the pros and cons of each scheme, as shown in Figure 3.

Error accumulation in AR decoding: Figure 3(a) shows that existing AR models experiences a larger performance gap compared to NAR methods as the prediction horizon increases, suggesting that AR may suffer from error accumulation.
Impact of trend strength: Figure 3(b) connects the performance gap with trend strength, indicating that strong trending effects can lead to significant performance differences between NAR and AR models. However, there are exceptions where strong trends do not cause substantial performance degradation in AR-based models.
Impact of seasonality strength: Figure 3(c) explains these exceptions by introducing seasonality strength as a factor. Surprisingly, AR-based models perform better in scenarios with strong seasonal patterns, likely due to their parameter efficiency in such contexts.
Combined effects of trend and seasonality: Figure 3(d) demonstrates the combined effects of trend and seasonality on performance differences.

Based on these analyses, researchers point out that the choice between AR and NAR decoding schemes in different research branches is primarily driven by the specific data characteristics in their focused forecasting scenarios. This explains the preference for the NAR decoding paradigm in most long-term forecasting models. However, this preference for NAR may overlook the advantages of AR, particularly its effectiveness in handling strong seasonality. Since both NAR and AR have their own strengths and weaknesses, future research should aim for a more balanced exploration, leveraging their unique advantages and addressing their limitations.

Detailed results and analysis on time-series foundation models

Researchers then extend the analysis framework to include recent time-series foundation models and examine their distributional forecasting capabilities.

Figure 4. Benchmark recent time-series foundation models with ProbTS

The results show that:

AR vs. NAR decoding in long-term forecasting: Figure 4(a) reaffirms the limitations of AR decoding over extended forecasting horizons. This suggests that time-series data, due to its continuous nature, may require special adaptations beyond those used in language modeling (which operates in a discrete space). Additionally, it is confirmed that AR-based and NAR-based models can deliver comparable performance in short-term scenarios, with AR-based models occasionally outperforming their NAR counterparts.
Distributional forecasting capabilities: Figure 4(b) compares the distributional forecasting capabilities of foundation models with CSDI, underscoring the importance of capturing complex data distributions. Current foundation models demonstrate limited support for distributional forecasting, typically using predefined distribution heads (e.g., MOIRAI) or approximated distribution modeling in a value-quantized space (e.g., Chronos).

These observations lead to several important conclusions: While AR-based models can be effective in short-term scenarios, their performance diminishes over longer horizons, highlighting the need for further refinement. Time-series data may require unique treatments to optimize AR decoding, particularly for long-term forecasting. The ability to accurately model complex data distributions remains a critical area for improvement in time-series foundation models.

Future directions: Evolving perspectives, models, and tools

Based on the evaluation and analysis of existing methods, researchers have proposed several important future directions for time series prediction models. These directions, if pursued, could significantly impact key scenarios across various industry sectors.

Future direction 1: Adopting a comprehensive perspective of forecasting demands. One primary future direction is to adopt a holistic perspective of essential forecasting demands when developing new models. This approach can help rethink the methodological choices of different models, understand their strengths and weaknesses, and foster more diverse research explorations.

Future direction 2: Designing a universal model. A fundamental question raised by these results is whether we can develop a universal model that fulfills all essential forecasting demands or if we should treat different forecasting demands separately, introducing specific techniques for each. While it is challenging to provide a definitive answer, the ultimate goal could be to create a universal model. When developing such a model, it is necessary to consider issues such as input representation, encoding architecture, decoding scheme, and distribution estimation module, etc. Additionally, future research is needed to address the challenge of distributional forecasting in high-dimensional and noisy scenarios, particularly for long horizons, and to leverage the advantages of both AR and NAR decoding schemes while avoiding their weaknesses.

Future direction 3: Developing tools for future research. To support future research in these directions, researchers have made the ProbTS tool publicly available, hoping this tool will facilitate advancements in the field and encourage collective efforts from the research community.

By addressing these future directions, researchers aim to push the boundaries of time-series forecasting, ultimately developing models that are more robust, versatile, and capable of handling a wide range of forecasting challenges. This progress holds the potential to significantly impact numerous industries, leading to better decision-making, optimized operations, and improved outcomes across critical sectors.

The post ProbTS: Unified benchmarking for time-series forecasting appeared first on Microsoft Research.