Research Focus: Week of May 13, 2024

Published May 15, 2024

Share this page

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning

Large language models (LLMs) have shown remarkable performance in generating text similar to that created by people, proving to be a valuable asset across various applications. However, adapting these models to incorporate new, out-of-domain knowledge remains a challenge, particularly for facts and events that occur after the model’s training knowledge cutoff date.

In a recent paper: Injecting New Knowledge into Large Language Models via Supervised Fine-Tuning, researchers from Microsoft investigate the effectiveness of supervised fine-tuning (SFT) as a method for knowledge injection in LLMs, specifically focusing on recent sporting events. They compare different dataset generation strategies—token-based and fact-based scaling—to create training data that helps the model learn new information. Their experiments on GPT-4 demonstrate that while token-based scaling can lead to improvements in Q&A accuracy, it may not provide uniform coverage of new knowledge. Fact-based scaling, on the other hand, offers a more systematic approach to ensure even coverage across all facts. The researchers present a novel dataset generation process that leads to more effective knowledge ingestion through SFT, and results show considerable performance improvements in Q&A tasks related to out-of-domain knowledge.

Read the paper

NEW RESEARCH

A Reflection on Human-Notebook Experiences in the Era of AI

Computational notebooks provide an interactive way to work with data. They have been widely used by data professionals to write code, explore data, and generate visualizations, all in one document. Previous research has revealed unique pain points around the user experience in computational notebooks. However, as AI tools like ChatGPT or Copilot have emerged, it is unclear whether these pain points have been reduced or changed, or whether new pain points have arisen. Due to the fast pace of advances in AI technology, most of the development of new AI tools has been primarily driven by technology and not by user experience.

In a recent paper: A Reflection on Human-Notebook Experiences in the Era of AI, researchers from Microsoft summarize literature on how new AI technology has impacted human-notebook interaction and human-computer interaction (HCI) paradigms, new challenges and user behavior around using AI assistants, and recent research on AI assistants in computational notebook scenarios. They outline gaps in existing literature and suggest a future focus on improving macro human-notebook experiences throughout a user’s workflow, measuring and quantifying the value of AI systems, and establishing a set of standards and best practices for AI tools.

Read the paper

NEW RESEARCH

Jacdac: Service-Based Prototyping of Embedded Systems

The traditional approach to programming embedded systems is monolithic: firmware on a microcontroller contains both application code and the drivers needed to communicate with sensors and actuators, using low-level protocols such as I2C, SPI, and RS232. In comparison, software development for the cloud has moved to a service-based development and operation paradigm: a service provides a discrete unit of functionality that can be accessed remotely by an application, or other service, but is independently managed and updated.

In a recent paper: Jacdac: Service-Based Prototyping of Embedded Systems (opens in new tab), researchers from Microsoft propose, design, implement, and evaluate a service-based approach to prototyping embedded systems called Jacdac (opens in new tab). Jacdac defines a service specification language, designed especially for embedded systems, along with a host of specifications for a variety of sensors and actuators. With Jacdac, each sensor/actuator in a system is paired with a low-cost microcontroller that advertises the services that represent the functionality of the underlying hardware over an efficient and low-cost single-wire bus protocol. A separate microcontroller executes the user’s application program, which is a client of the Jacdac services on the bus.

Three Jacdac kits, comprising over twenty modules, have been produced by third-party manufacturers: KittenBot (opens in new tab) and Forward Education (opens in new tab).

Read the paper

NEW RESEARCH

PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models

Evaluation of multilingual LLMs is challenging due to a variety of factors – the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data, and the lack of local, cultural nuances in translated benchmarks. Hence, it is difficult to extensively evaluate LLMs in a multilingual setting, leading to lack of fair comparisons between models and difficulties in replicating the evaluation setup used by some models. Recently, several Indic (Indian language) LLMs have been created to help build more locally and culturally relevant LLMs.

In a recent paper: PARIKSHA: A Scalable, Democratic, Transparent Evaluation Platform for Assessing Indic Large Language Models, researchers from Microsoft present an evaluation framework, which is the first comprehensive evaluation of Indic LLMs using a combination of human and LLM-based evaluation. The researchers conduct a total of 90,000 human evaluations and 50,000 LLM-based evaluations of 29 models to present leaderboards for 10 Indic languages. Pariksha provides inclusive evaluation by engaging a community of workers that represent India’s large and diverse workforce and also serves as a research platform for improving the process of evaluation. For transparency on the process, the evaluation artifacts will be released. Conducting Pariksha at regular intervals, the researchers aim to enable models to improve over time with insights and artifacts from their evaluations.

Read the paper

NEW RESEARCH

Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists

Many responsible AI resources, such as toolkits, playbooks, and checklists, have been developed to support AI practitioners in identifying, measuring, and mitigating potential fairness-related harms. These resources are often designed to be general purpose, in order to address a variety of use cases, domains, and deployment contexts. However, this can lead to decontextualization, where such resources lack the level of relevance or specificity needed to use them.

To understand how AI practitioners might contextualize one such resource, an AI fairness checklist, for their particular use cases, domains, and deployment contexts, researchers from Microsoft conducted a retrospective contextual inquiry with 13 AI practitioners from seven organizations. In a recent paper: Tinker, Tailor, Configure, Customize: The Articulation Work of Customizing AI Fairness Checklists, they identify how contextualizing this checklist introduces new forms of work for AI practitioners and other stakeholders, while opening up new sites for negotiation and contestation of values in AI. The researchers also identify how the contextualization process may help AI practitioners develop a shared language around AI fairness. They also identify dynamics related to ownership over this process that suggest larger issues of accountability in responsible AI work.

Read the paper

NEW RESEARCH

MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels

LLMs are becoming indispensable tools for many creative and information related tasks, but they still come with limitations, including a tendency to fabricate content. State-of-the-art algorithms pair the LLM with an external, dynamically updated knowledge base to ground the LLM’s answers and provide up-to-date information. However, these techniques require large amounts of relevant, labeled training data that have not previously been publicly available.

In a recent paper: MS MARCO Web Search: A Large-scale Information-rich Web Dataset with Millions of Real Click Labels presented at the 2024 ACM Web Conference, researchers from Microsoft introduce a novel dataset that closely mimics real-world web document and query distribution. MS MARCO Web Search contains 10 million unique queries across 93 languages with millions of relevant labeled query-document pairs. It uses ClueWeb22’s 10 billion high-quality web pages as the document corpus and provides rich information for various kinds of downstream tasks.

This dataset unlocks several new research directions that previous datasets cannot well support, including generic end-to-end neural indexer models, generic embedding models, and next generation information access systems with LLMs. MS MARCO Web Search offers a retrieval benchmark with three web scale retrieval challenge tasks, each with automatic evaluation and leaderboard. These tasks demand innovation in both machine learning and information retrieval systems. The researchers intend for MS MARCO Web Search to lay the groundwork for future advancements in AI and systems research.

View dataset

Read the paper

VIDEO

AI Case Studies for Natural Science Research with Bonnie Kruft

Among the stunning changes and disruptions driven by AI, one of the most significant is the impact on scientific discovery. In her presentation at EmTech Digital 2024 (opens in new tab), Bonnie Kruft, partner deputy director at Microsoft Research AI for Science, outlined some examples of how generative AI enables groundbreaking research in the natural sciences. Recent breakthroughs aided by AI include small molecular inhibitors for treating infectious disease, the discovery of new materials for energy storage, and new drug development.

Catch a replay of the presentation, including a follow-up Q&A with the audience, and hear how researchers are reducing discovery times from years to months. The discussion explores safe and responsible AI practices, how large language models can work with science-based models, and what lies ahead for AI in science.

Watch the video