Research Forum Brief | September 2024 Articles

Fostering appropriate reliance on AI

Tue, 03 Sep 2024 19:10:21 +0000

Presented by Mihaela Vorvoreanu at Microsoft Research Forum, September 2024

“This is where I think it is our responsibility as people working in UX disciplines—as people researching UX and human-computer interaction—to really, really step up to the front and see how it is our moment to shine and to address this problem.”
– Mihaela Vorvoreanu, Director UX Research and Responsible AI Education, Microsoft Aether

Microsoft research copilot experience What approaches are being developed to foster appropriate reliance on AI, ensuring users can discern when to accept or reject AI recommendations?

Transcript: Lightning Talk

Fostering appropriate reliance on AI

Mihaela Vorvoreanu, Director UX Research and Responsible AI Education, Microsoft Aether

Because of their probabilistic nature, all AI systems will make mistakes. One of the main challenges in human-AI interaction is to foster appropriate reliance on AI and empower users of AI systems to determine when to accept or not accept an AI system’s recommendation. Hear about the work we’re doing at Microsoft to foster appropriate reliance and help people accept AI outputs when they are correct and reject them when they are wrong.

Microsoft Research Forum, September 3, 2024

MIHAELA VORVOREANU: Hi, everyone. My name is Mihaela, or Mickey, Vorvoreano. I lead UX Research and Responsible AI Education in Aether, Microsoft’s research and advisory body on AI ethics and effects in engineering and research. And in a previous life, I was a professor of UX design and research.

During the past few months, I’ve had the privilege of leading a cross-company team of researchers and product builders focused on fostering appropriate reliance on AI, specifically generative AI and our Copilot product. In this working group, we think of fostering appropriate reliance on AI as striking a balance between people not overrelying too much on AI and accepting its outputs when they are incorrect or incomplete, and not under-relying and not using or trusting AI outputs even when they could be useful. And so across all of us, we have started looking into how we can foster research that leads to improvement in our own products.

My team started looking into the problem of overreliance on AI quite a while back. About two years ago, we released this first review of research literature about overreliance on AI. In that paper, we isolated antecedents, mechanisms, and consequences of overreliance on AI and the series of mitigations that showed promise in the research literature. However, as we know, many such mitigations can backfire, actually increasing overreliance rather than mitigating it.

More recently, we released a second synthesis of research literature. This one focused specifically on generative AI. We find that generative AI makes this tricky problem of overreliance even more difficult for several reasons, one of them being that it is so much more difficult to spot incorrect or incomplete AI outputs, especially when they are formulated so fluently and with such impressive grammar. In this paper, we also looked at some overreliance mitigations. Some of them have been mentioned in the literature before, such as cognitive forcing functions, and others [are] quite new that involved using generative AI to critique existing answers or to stimulate critical thinking in generative AI users.

As Eric Horvitz and Abby Sellen point out in the recent opinion piece [that] using generative AI places a high cognitive burden on regular people during everyday life. Such levels of attention and vigilance were only previously expected of highly trained professionals, such as airline pilots. And so in our group, we wonder how might we make use of generative AI products a little bit easier so people can maximize the benefits [and] minimize the risks while not spending as much mental energy as an airplane pilot would.

In our internal research—and here I want to acknowledge my wonderful team members who have done all of this research—we have identified three possible directions. Each one of these is a problem/an opportunity. The first one is that most people, even advanced users of generative AI, don’t have useful mental models of how these technologies work. They mostly think of them as traditional web search, and that doesn’t always come in handy. This points to the opportunity of helping people form useful mental models through AI literacy. We can create AI literacy, not only through formal or informal education, but also through responsible communication in journalism and in marketing, and also during interaction with a product.

We could do a better job of teaching people about generative AI while they interact with generative AI products. Here, the guidelines for human AI interaction from the HAX Toolkit—particularly guidelines 1, 2, and 11—which really emphasize how important it is to make clear to users the system’s not only capabilities, but also limitations, and [it] provide some explanations of how it works so that they can form mental models. This is where these guidelines can really come into play.

I also invite you to keep an eye out on the HAX Toolkit because we have been adding new examples and content related to appropriate reliance specifically. This is one idea of how we could intervene at the user interaction layer to actually foster AI literacy and more useful mental models.

The second direction and the second research finding is that overall, people are not inclined to verify AI outputs. Also, if you think about one of the most popular strategies that’s used in most products to date, is what I like to call the warning sticker strategy, where we might show something like, “AI-generated content might be incorrect.” This is partially useful. People seem to have learned that.

However, this type of notice doesn’t mention that AI-generated content might also be incomplete. And so people might miss out altogether on the fact that important or useful information is not in the answer in the first place. That also raises the opportunity of how might we get people’s attention, arouse that attention and vigilance just a little bit, so they know when it is time to check answers versus not in more important or high-risk situations.

In the research that we highlight on the working team’s webpage (opens in new tab), we show some papers that talk about communicating uncertainty verbally, via text output, or via highlights that might help users spot when it might be time to increase their alertness level and verify outputs more carefully.

Finally, the third direction is that the user experience of verifying generative AI outputs is rather difficult for many people. The primary UI paradigm that we use for this is to cite sources like we do in a research or a school paper. Now, this format in itself suggests a level of rigor and trustworthiness that AI-generated outputs might not be equal with research papers. Because of this signal, people might not be inclined to verify because what’s really more trustworthy than a research or a school paper.

This raises the opportunity of how might we make the relationship between AI-generated outputs and the information that they work with—their grounding data—more transparent. How might we make it easier to verify, to spot discrepancies, to spot incompleteness? But also looking even further into how we might use LLMs to propose critiques of their own responses, or as we see in some research that we highlight on the webpage, to actually not just give people a response, but stimulate people to engage in critical thinking, which could be a very different paradigm of interacting with generative AI and large language models in particular.

Throughout all this, what I would really like to highlight, and I do this with my co-authors in this piece (opens in new tab) that appeared as an opening article in ACM interactions not very long ago, is really that this is a moment for UX disciplines to shine. As you can see, a lot of these mitigations, a lot of these techniques for fostering appropriate reliance, are UX interventions.

This is where I think it is our responsibility as people working in UX disciplines—as people researching UX and human computer interaction—to really, really step up to the front and see how it is our moment to shine and to address this problem. That being said, I hope you stay in touch. I hope you follow our research, which we publish on the team’s webpage (opens in new tab), and I hope you help us follow your research. So maybe together, we can work towards making progress on this very tricky but important problem. With that, I want to thank you so much for following this presentation. I look forward to working with you.

Project The HAX Toolkit Project

Download HAX Playbook

Article Appropriate reliance research initiative

Publication Overreliance on AI: Literature Review

Publication Appropriate reliance on Generative AI: Research synthesis

Publication The Rise of the AI Co-Pilot: Lessons for Design from Aviation and Beyond

The post Fostering appropriate reliance on AI appeared first on Microsoft Research.

]]>

A generative model of biology for in-silico experimentation and discovery

Tue, 03 Sep 2024 19:09:22 +0000

Presented by Kevin Yang at Microsoft Research Forum, September 2024

“EvoDiff is a discrete diffusion model trained on evolutionary-scale protein sequence data. By evolutionary scale, we mean that we train on sequences taken from across many different organisms and that perform many different functions.”
– Kevin Yang, Senior Researcher, Microsoft Research New England

Microsoft research copilot experience What are the capabilities of generative models in biology, and how are they enabling in-silico experimentation and discovery?

Transcript: Lightning Talk

A generative model of biology for in-silico experimentation and discovery

Kevin Yang, Senior Researcher, Microsoft Research New England

This talk discusses how deep learning is enabling us to generate novel and useful biomolecules, allowing researchers and practitioners to better understand biology.

Microsoft Research Forum, September 3, 2024

KEVIN YANG: Hi. I’m Kevin K. Yang, senior researcher at Microsoft Research, and I’ll be presenting on generative models of biology for in-silico experimentation and discovery.

Our mission in the Biomedical ML Group at MSR [Microsoft Research] is to develop AI systems that contribute to biomedical knowledge via generative design and interactive discovery across length scales from molecules to patients.

At the smallest scale, we model biomolecules such as nucleic acids and proteins. These molecules function within cells, which we model specifically in the context of understanding and treating cancer. Cells form tissues. We build generative models of histopathology images in order to improve diagnostics. Finally, we study genetics and biomarkers at the whole patient level to better understand health and disease.

Today, we’ll focus on the molecular level with our protein engineering work.

Proteins are the actuators of biology. Each of our cells contains 1 to 3 billion protein molecules at any given time. Proteins catalyze metabolic reactions, replicate DNA, respond to stimuli such as light and scent, provide structure to cells and organisms, and transport molecules within and between cells.

All of this functional diversity is encoded in just 20 chemical building blocks called amino acids. Proteins are sequences of dozens to thousands of amino acid residues. In nature, these sequences often fold into a three-dimensional structure, which then performs a cellular function.

A protein’s structure and function are completely determined by its amino acid sequence. Protein design seeks to generate the amino acid sequences of new proteins that perform useful and novel functions. For example, engineered proteins in laundry detergent help remove stains while other proteins are of great interest as gene editors.

My research focuses on training neural networks and the natural diversity of proteins in order to generate new protein sequences that hopefully encode new functions. Today, I’ll focus on the model called EvoDiff. This work was done in collaboration with Sarah Alamdari, Nitya Thakkar, Rianne van den Berg, Alex Lu, Nicolo Fusi, and Ava Amini.

EvoDiff is a discrete diffusion model trained on evolutionary-scale protein sequence data. By evolutionary scale, we mean that we train on sequences taken from across many different organisms and that perform many different functions.

Diffusion models were first popularized for generating images. During training, a diffusion model learns to remove noise added to a data point. In this case, we randomly mask some amino acid residues from a protein and train the model to predict the identities of the masked residues. After training, EvoDiff can generate new protein sequences, beginning with a sequence of all masks by decoding one position at a time.

Here, we show an example and also show the predicted structure of the generated protein after each decoding step. We see that EvoDiff generates plausible and diverse proteins across a variety of lengths. We visualize the predictions using their predicted 3D structures. The structural prediction model also outputs a confidence metric called pLDDT [predicted local distance difference test], which ranges from 0-100. EvoDiff is able to generate sequences that are likely to fold into stable structures by this metric.

These sequences are also distinct from anything seen in nature.

Often, protein engineers want proteins that perform a similar function to a natural protein, or they want to produce a protein that performs the same function but has other desirable properties, such as stability. By conditioning EvoDiff with a family of related sequences, we can generate new proteins that are very different in sequence space to the natural proteins but are predicted to fold into similar three-dimensional structures. These may be good starting points for finding new functions or for discovering versions of a protein with desirable properties. Finally, EvoDiff can also generate a complete protein sequence conditioned on a desired functional motif.

Biological functions, including binding and catalysis, are often mediated by a small structural motif held in the correct orientation by a scaffold. One way to design new functional proteins is to hand design these functional motifs, and then to generate a scaffold that will position the residues of the motif in the desired orientation. Traditionally, this is done by designing the protein structure and then finding a protein sequence that will fold to the desired structure. Here, we specified a desired functional motif from a natural protein in green, then resampled the rest of the protein around it using EvoDiff.

The new protein sequence is predicted to maintain the functional orientation of the motif with high resolution, demonstrating that we can perform motif scaffolding entirely in sequence space. By training on an evolutionary-scale dataset of 40 million proteins with many different natural functions from many different organisms, EvoDiff is able to generate plausible and diverse sequences. In addition, we have demonstrated the ability to condition on evolutionarily related sequences or undesired function motifs within a sequence.

Looking ahead, our next goal is to train generative models that allow finer grain control of the desired function in the form of text or a chemical reaction. This sort of conditional protein design will expand the scope of applications for designed proteins in chemistry, biology, and medicine. Finally, generative models of proteins can be a building block for models of cells, tissues, and patients, as we seek to design and understand biology.

If you enjoyed this talk, please go read our preprint, or you can use the code in our GitHub (opens in new tab) to generate your own proteins. Thank you.

Research Lab Microsoft Research Lab – New England

Group Biomedical ML

Publication Protein generation with evolutionary diffusion: sequence is all you need

The post A generative model of biology for in-silico experimentation and discovery appeared first on Microsoft Research.

]]>

Project Aurora: The first large-scale foundation model of the atmosphere

Tue, 03 Sep 2024 19:08:05 +0000

Presented by Megan Stanley at Microsoft Research Forum, September 2024

“If we look at Aurora’s ability to predict pollutants such as nitrogen dioxide that are strongly related to emissions for human activity, we can see that the model has learned to make these predictions with no emissions data provided. It’s learned the implicit patterns that cause the gas concentrations, which is very impressive.”
– Megan Stanley, Senior Researcher, Microsoft Research AI for Science

Microsoft research copilot experience How does Project Aurora, the first large-scale foundation model of the atmosphere, aim to transform weather forecasting and climate impact prediction?

Transcript: Lightning Talk

Project Aurora: The first large-scale foundation model of the atmosphere

Megan Stanley, Senior Researcher, Microsoft Research AI for Science

This talk discusses Aurora, a cutting-edge foundation model that offers a new approach to weather forecasting that could transform our ability to predict and mitigate the impacts of extreme events, air pollution, and the changing climate.

Microsoft Research Forum, September 3, 2024

MEGAN STANLEY: Hi. My name is Megan Stanley, and I’m a senior researcher in Microsoft AI for Science, and I’d like to tell you all about Aurora, our foundation model of the atmosphere.

Now, weather forecasting is critical in our societies. Whether that’s for disaster management, planning supply chains and logistics, forecasting crop yields, or even just knowing whether we should take a jacket out when we leave the house in the morning, it has day-to-day significance for all of us and is very important to the functioning of our civilization. In addition, in the face of a changing climate, we need more than ever to predict how the patterns of our weather will change on an everyday basis as the earth system we all inhabit undergoes a shift.

Traditionally, the atmosphere and its interactions with the Earth’s surface and oceans, as well as the incoming energy from the sun, are modeled using very large systems of coupled differential equations. In practice, to make a forecast or simulate the atmosphere, these equations are numerically integrated on very large supercomputers. They also have to assimilate observations from the current state of the weather in order to have correct initial conditions. Putting all of this together means that making a single weather forecast is computationally extremely expensive and slow, and the simulation must be rerun for every new forecast. At the same time, the set of equations used cannot completely capture all of the atmospheric dynamics, and this ultimately limits the accuracy that can be obtained.

With Aurora, we aim to demonstrate state-of-the-art medium-range weather forecasting—that is, for time periods out to a couple of weeks—and to do so with a model that learns a good general representation of the atmosphere that can be tuned to many downstream tasks. It is our bet that, similar to the breakthroughs in natural language processing and image generation, we can make significant advances by training a large deep learning model on the vast quantity of Earth system data available to us.

Aurora represents huge progress. We demonstrate that it can be fine-tuned to state-of-the-art performance on operational weather forecasting, as well as previously unexplored areas in deep learning of atmospheric pollution prediction. It’s able to do all of this roughly 5,000 times faster than current traditional weather forecasting techniques. In addition, if we compare to the current state of the art in AI weather forecasting, the GraphCast model, we’re able to outperform it on 94 percent of targets, and we do so at a higher spatial resolution in line with the current traditional state of the art.

Aurora achieves this by training on more data and more data that is more diverse, training a larger model at the same time. We also demonstrate that, as a foundation model, it has the possibility of being fine-tuned on a wide range of very important downstream tasks. As a foundation model, Aurora operates using the pretrain–fine-tune paradigm. It’s initially trained on a large quantity of traditional weather forecasting and climate simulation data. This pretraining phase is designed to result in a model that should carry within it a useful representation of the general behavior of the atmosphere so that then we can fine-tune it to operate in scenarios where there is much less data or data of less high quality.

So examples of the scarce data scenario? Well, weather forecasting at the resolution of the current gold standard of traditional methods, that is the IFS system, operating at 0.1 degrees resolution, or approximately 10 kilometers. Another good example is prediction of atmospheric pollution, including gases and particulates, where the current gold standard is an additional, very computationally expensive model applied to the IFS from the Copernicus atmospheric modeling service, or CAMS. This problem is generally very challenging to traditional forecasting systems, but it’s of critical importance.

We’re able to show that Aurora outperforms IFS on 92 percent of the operational targets, and it does this particularly well in comparison at forecasting times longer than 12 hours while being approximately 5,000 times faster. When we look at the ability of Aurora to predict weather station observations, including wind speed and temperature, it’s better in general than traditional forecasting systems. It really is able to make accurate predictions of the weather as we experience it on Earth.

On the atmospheric pollution task, Aurora is able to match or outperform CAMS in 74 percent of cases, and it does so without needing any emissions data as an input. This task has never before been approached with an AI model. If we look at Aurora’s ability to predict pollutants such as nitrogen dioxide that are strongly related to emissions for human activity, we can see that the model has learned to make these predictions with no emissions data provided. It’s learned the implicit patterns that cause the gas concentrations, which is very impressive. It’s also, very impressively, managed to learn atmospheric chemistry behavior. You can see this here, where as the gas is exposed to sunlight, this causes the changes between night and day concentrations of nitrogen dioxide.

Aurora is also capable of forecasting extreme events as well as the state-of-the-art traditional techniques. Here it is seen correctly predicting the path of storm Ciarán, which hit Northwestern Europe in early November 2023, causing record-breaking damage and destruction. In particular, Aurora was the only AI model that could correctly predict the maximum wind speed during the storm as it picked up when it made landfall.

In conclusion, Aurora is a foundation model that really is the state of the art in AI and in general weather forecasting in terms of its ability to produce correct operational forecasts. It does so 5,000 times faster than traditional weather forecasting techniques. Moreover, because it’s a foundation model, it unlocks new capabilities. It can be fine-tuned on downstream tasks where there’s scarce data or that haven’t been approached before. We believe that Aurora represents an incredibly exciting new paradigm in weather forecasting. This is much like the progress we’ve seen across the sciences, where the ability to train AI models at massive scale with vast quantities of accurate data, has unlocked completely unforeseen capabilities.

If you want to learn more about how my colleagues and I at AI for Science achieve this, please refer to our publication. Thank you.

Research Lab Microsoft Research AI for Science

Blog Introducing Aurora: The first large-scale foundation model of the atmosphere

Project Aurora Forecasting

The post Project Aurora: The first large-scale foundation model of the atmosphere appeared first on Microsoft Research.

]]>

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Tue, 03 Sep 2024 19:07:10 +0000

Presented by Corby Rosset at Microsoft Research Forum, September 2024

“The traditional way to fine-tune an LLM for post-training … basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. … Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.”
– Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

Microsoft research copilot experience What is Direct Nash Optimization, and how does it enable language models to self-improve using general preferences?

Transcript: Lightning Talk

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers

This talk discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as AlpacaEval and MT-Bench.

Microsoft Research Forum, September 3, 2024

CORBY ROSSET: Hi, I’m Corby. I’m a scientist in Microsoft Research. Today, we’re going to be talking about Direct Nash Optimization, which is a technique to help language models self-improve.

We all know that there are two main ways to improve language models. One is to scale up the number of parameters or to scale up the amount of training data. Both of these approaches are costly even for the post-training techniques. The traditional way to fine-tune an LLM for post-training is using SFT. SFT basically tells the model to emulate good behaviors, but it does not target or correct any mistakes or bad behaviors that it makes explicitly. More advanced post-training techniques such as RLHF use a fixed reward model, which can be easily hacked or go stale during training and involves much more complex reinforcement learning, which can be unstable. Self-improving post-training explicitly identifies and tries to correct bad behaviors or mistakes that the model makes.

Before we move on, we want to give a concrete example of what we mean by self-improving behavior. Here’s a simple geometry problem where a base model that was already SFTed makes a simple arithmetic error on the left-hand side. After our self-improving technique, the model is able to correct this mistake.

Here we give a simple overview of how Direct Nash Optimization works. One of the properties of generative LLMs is that you can sample multiple outputs from them. This is advantageous because what we can do is, given an input, we can take our language model and sample, in this case, two outputs—answer A and answer B—and we can have them scored or rated by a preference function oracle, which tells us which response is better. Then we can use a contrastive training mechanism, such as DPO or IPO or others to update the parameters of the language model to hopefully improve it. In the next iteration, timestep t+1, we repeat the process over again. The key insight of this technique is how we define reward. Typically, in the RLHF framework, we want to maximize the reward of a language model policy against some given external reward model. Here, we redefine “reward” as the expected win rate against your own behavior as judged by a preference function P. What this means is that for a given response y to an input x, the reward of that response is defined as the expected win rate against y primes sampled from the policy itself. Hence, rewards are maximized by responses that are preferred over other responses.

When you start comparing the y primes, or the model’s own outputs to each other, this incentivizes a self-improving behavior because you’re basically competing against yourself. You can formulate this in a game theoretic manner where, in this game, you have a single player which is competing against itself, and the payoffs are given by the preference function. In this game, a Nash equilibrium is achieved by the best possible π* whose responses are preferred over any other competing policy in its class.

At a high level, Direct Nash Optimization has many advantages. Firstly, it optimizes towards a more general preference function directly rather than a point-wise reward model, which is limited in its expressibility since it can’t model transitive preferences. Secondly, it is an iterative algorithm, meaning it is much simpler to implement. We use a contrastive update as the loss, which does not involve any policy gradients or heavy reinforcement learning machinery. We also sample on policy outputs from the model and compare them to each other in a self-play framework. We use a powerful preference annotator—in this case, GPT-4—to rank or judge the best response among them. This approach is also flexible since we can compare the responses to each other but also to outputs from a more powerful teacher such as GPT-4, which provides even bigger improvements. Most importantly, this algorithm is theoretically guaranteed to monotonically approach the Nash equilibrium, hence the name Direct Nash Optimization.

If you implement this algorithm correctly, you will find state-of-the-art results on several benchmarks, including this one, which is AlpacaEval2. This benchmark basically measures how well language models follow instructions and align with human expectations. This benchmark computes a win rate of the language model’s outputs versus a powerful reference—in this case, GPT-4—in a side-by-side comparison. The y-axis is the win rate, and the x-axis is the amount of iterations of training. We see that the dark blue line, which is DNO, the vanilla implementation, outperforms two important baselines. The red line is SFT, and the orange and yellow lines are offline contrastive algorithms, such as DPO and KTO. Hence, we see that self-improving post-training is better than offline contrastive training and SFT. Notably, DNO is also able to outperform similar training techniques from other models, which were 10 times as large, namely the gray line, which was a 70 billion parameter Llama model. We are also encouraged to see that these results do not saturate, and with more training in the purple line over more iterations, we see even better results.

We hope this work inspires other researchers to continue to investigate self-improving post-training as an effective method for aligning language models with human expectations. Thank you for watching.

Research Lab AI Frontiers

Group Augmented Learning and Reasoning

The post Direct Nash Optimization: Teaching language models to self-improve with general preferences appeared first on Microsoft Research.

]]>

Analog optical computing for sustainable AI and beyond

Tue, 03 Sep 2024 19:06:05 +0000

Presented by Francesca Parmigiani and Jiaqi Chu at Microsoft Research Forum, September 2024

“I have been working with a fantastic team to build a new kind of computer … it uses physics and physical systems to do … computation, which means it has the potential to be 100 times more efficient compared to state-of-the-art GPUs.”
– Jiaqi Chu, Principal Researcher, Microsoft Research Cambridge

Microsoft research copilot experience How does analog optical computing promise to enhance the efficiency and sustainability of AI and other applications?

Transcript: Lightning Talk

Analog optical computing for sustainable AI and beyond

Francesca Parmigiani, Principal Researcher, Microsoft Research Cambridge
Jiaqi Chu, Principal Researcher, Microsoft Research Cambridge

This talk discusses a new kind of computer—an analog optical computer—that has the potential to accelerate AI inference and hard optimization workloads by 100x, leveraging hardware-software co-design to improve the efficiency and sustainability of real-world applications.

Microsoft Research Forum, September 3, 2024

JIAQI CHU: Hi, everyone. I’m Jiaqi, a researcher at Microsoft. Over the past three years, I have been working with a fantastic team to build a new kind of computer. It doesn’t use logic case; it doesn’t use bits. It uses physics and physical systems to do computation, which means it has a potential to be 100 times more efficient compared to state-of-the-art GPUs. [The] really neat thing is that we are building it using the technologies that are soon prevalent in consumer space.

There is a catch here. This is not a general-purpose computer. It is accelerating two different but very broad classes of applications: machine learning inference and hard optimization problems. For the machine learning inference part, we have been able to show the potential of accelerating diffusion models that can generate images and other content using this computer. Actually, there are emerging forms of machine learning that can really take advantage of the amazing amount of computing offered and achieve high-level properties, like better [generalization] to out-of-distribution data. Second, the same computer can solve hard or combinatorial optimization problems. We have identified real-world problems from many industry verticals, from healthcare, finance, chemical engineering, to robotics, that could be accelerated using this computer. Exactly the same computer supporting a wide range of applications.

But before we talk about these applications and the computer, I want to go after the “why” question. I’m sure all of you have had firsthand experience of the amazing capabilities of the latest machine learning models. We are just at the start of this inflection [point]. We expect that the capabilities of those models will grow tremendously, as long as we can keep pouring exponentially increasing amount of [compute]. But this is a big problem, not just because we are spending billions and billions of dollars on AI infrastructure to train and service models, there are also serious environmental concerns about the energy and other resources that are being consumed here. I genuinely believe sustainability of AI is one of the most important questions. Unfortunately, this couldn’t be happening at a worse time. Right when these computer demands are taking off, the future trends for digital computing do not look good, with Moore’s law slowing down. This is not just our observation; it is a broader industry concern.

Over the past five/six years, this has led to a fantastic amount of research and development. Many companies, many startups, have built nontraditional computers from the broad family of analog technologies. In this context, our journey started a few years ago. Last year, we had the first generation of our computer. It was built using bulky technology, but it was already solving a scaled-down version of a real-world finance problem from Barclays. We are actually outperforming the same problem being solved on a quantum computer, which gave [us] a lot of confidence. It led to our research collaboration with Barclays, a partnership with the Microsoft Health Futures team. I’m really excited to share that we have just completed the second generation of [this] computer. It is much smaller in physical size, and this is a world first in that exactly the same computer is simultaneously solving hard optimization problems and accelerating machine learning inference.

Looking ahead, we estimate that at scale, this computer can achieve around 450 tera operations per second per watt, which is a 100-times improvement as compared to state-of-the-art GPUs. Let me now move on to give you an [introduction] to how we can compute using physical technologies. Let’s start with the basic mathematical operations: multiplication and addition. If I take a light source, and if I shine it on a filter, like the one that you have in your camera, and I can’t have any shade of gray on my filter when light passes through. This is a multiplication by weight between zero and one. This is happening simultaneously for tens of thousands of light beams that are going through this filter in a completely passive power-free fashion—massively parallel multiplication using light-matter interaction.

Similarly, when I have multiple beams of light that fall on a pixel on my smartphone camera, they add up to the photons to produce current—massively parallel addition using light-matter interaction. Once I have addition and multiplication, I can implement a vector matrix multiplier. Benefit from the inherent parallelism of optics, we can implement a massively parallel systolic vector-matrix multiplier. We are building these using consumer technologies. Our input vector is an array of micro-LEDs, the next big thing in the display space. The matrix in the middle is a chip that we use in digital projectors, and I have a sample here—a standard projector with four million pixels on it. In theory, it can simultaneously do four-million multiplications when light bounces off this. Our output vector is exactly the same chip [as] in our smartphone cameras, the standard CMOS sensor—technologies with an existing manufacturing ecosystem that we can dovetail behind.

For the most interesting applications, we also need nonlinearities, normalization, 10-edge sigmoidal. We implement this using chief-scale analog electronics by CMOS chips. Our choice to combine optics and analog electronics is unique in the industry. Hence, the name of “Analog Optical Computing” or for short, AOC. These are not just cartoons in slides. We have just completed the second generation of our computer, and my colleague, Francesca, will tell you about what this computer is solving.

FRANCESCA PARMIGIANI: The AOC computer has the potential to speed up two broad classes of applications, machine learning inference and hard optimization problems. The first example we run on our computer is the MNIST classification. We’ve trained the model a priori on GPUs, and we have encoded it on our projectors. As you can see, the digits are being successfully classified by our computer live and at a very high accuracy. But what’s more important here is that the computer is exactly doing what our emulator platform, our digital twin, is predicting, which really gives us the confidence that the computer is working correctly.

Exactly the same computer can also successfully solve hard optimization problems. As a second example, we have encoded onto the very same projectors an optimization problem. A 100% rate in these graphs means that I solve the problem correctly all the time. When I put now my hand in front of one of the projectors, I block the optical path, and so the computer loses track of what the problem is trying to solve. As a result, it attempted to solve it randomly, and the success rate dropped to zero. Once I remove my hand, the computer regains its understanding of the problem that’s solving, and then the success rate returns to 100%.

Looking ahead as we build a future large-scale generation of our computer, it is really critical to co-design the computer with the application to really take advantage of what the computer is good at and really to compensate for its deficiencies. Noise, for example, has always been the main historical challenge with analog computers. Fortunately, machine learning models are relatively amenable to noisy computers. For some models, like diffusion models, noise can actually be your friend rather than your enemy. Used in Bing and Copilot, just to name a few, such diffusion models work as follows: You have your training image, and then over time, you are adding noise to them until you end up with just complete noise. At inference, you run the reverse denoising process, starting from complete noise, and then you end up generating a clean-looking image, a dog in this instance. Importantly, this reverse process is iterative in nature. It is computationally expensive, and it requires a denoise. All requirements that perfectly fit our computer. We have implemented a very small version of such a model to generate MNIST digits using our digital twin, and we aim to run it on our computer very soon. As we then increase the size of the model, we can run advanced images, such as fashion MNIST, cipher images, and many more.

Diffusion, though, is only one example of broader classes of analog-amenable machine learning models that have this feedback nature. Others include deep equilibrium model, neural ODEs, and actually, even some of the latest models like flow matching and state space model, seem to be amenable to our computer, which is really fantastic news for us.

The same notion of co-design is also key for optimization. Let me give you a real-world example from the healthcare sector. Most likely, you or your loved ones have been inside an MRI scan, not really a great place to be in. Imagine if you can reduce that amount of time from 20–40 minutes to less than five minutes. The implication [is] for the patient’s experience and the treatment’s modalities. Actually, the math here is 15 years old, something called compressed sensing. The idea is that when your patient is inside the scanner, you are under-sampling this image—or more precisely, the scan in the freer space—and then you are solving a hard optimization problem to recover ideally the image with full fidelity. Because the problem was computationally hard, it never took off, but we have been able to map this optimization problem to our formulation in our computer. You can see the corresponding results here and how we can iteratively converge to the ground-truth scan using the AOC algorithm. Based on our earlier investigation, we think we could be able to accelerate MRI scan by a factor of 4–8x while achieving reconstruction with high fidelity, potentially reducing the scanning time down to five minutes only.

Certainly, this is an extremely high-risk but high-ambitious project, and it’s super exciting that Microsoft Research supports and, in fact, encourages such work. Of course, none of this would be possible without a fantastic interdisciplinary team behind [us] to really rethink across the whole computer stack. All these people here, world leaders in their own discipline, instead of carrying out their own research in silos, they’ve chosen to work together and operate at the boundary of their disciplines, which is where I believe key breakthroughs can happen. But this is not enough. We also need to build a community. Towards this, we are calling out for people to submit and participate to our workshop (opens in new tab) at NeurIPS 2024, where we aim to bring together ML and hardware experts. We are also looking to expand our collaboration to gain more experience in solving industry-specific optimization problems.

Towards this, we have launched an online service to allow partners to map and run their problems to our computer. To wrap up, we have been building a new kind of analog optical computer, which has the potential to offer a step change in computer performance using consumer technology. The most important thing I want to leave you with is how we can co-design our application with the underlying computer as the only way for this technology to have a chance in the future of computing. Thank you for listening.

Research Lab Microsoft Research Lab – Cambridge

Research Lab Microsoft Health Futures

Event MLNCP Workshop at NeurIPS 2024

The post Analog optical computing for sustainable AI and beyond appeared first on Microsoft Research.

]]>

Panel Discussion: Beyond Language: The future of multimodal models in healthcare, gaming, and AI

Tue, 03 Sep 2024 19:04:57 +0000

Hosted by John Langford, with Katja Hofmann, Jianwei Yang, and Hoifung Poon at Microsoft Research Forum, September 2024

“I believe that starting to understand what that new human-AI collaboration paradigm could look like, that is something that everyone with computer access would be able to experience within the next five years.”
– Katja Hofmann, Senior Principal Researcher, Microsoft Research

Microsoft research copilot experience What are the potential future applications and challenges of multimodal models in healthcare, gaming, and AI?

Transcript: Panel Discussion

Beyond Language: The future of multimodal models in healthcare, gaming, and AI

Katja Hofmann, Senior Principal Researcher, Microsoft Research
Jianwei Yang, Principal Researcher, Microsoft Research Redmond
Hoifung Poon, General Manager, Microsoft Research Health Futures
John Langford (host), Partner Research Manager, Microsoft Research AI Frontiers

This discussion delves into the transformative potential and core challenges of multimodal models across various domains, including precision health, game intelligence, and foundation models. Microsoft researchers share their thoughts on future directions, bridging gaps, and fostering synergies within the field.

Microsoft Research Forum, September 3, 2024

JOHN LANGFORD: Hello, everyone. I’m John Langford. I’m joined by Katja Hofmann, Hoifung Poon, and Jianwei Yang, each of whom is working on multimodal models of actually quite different varieties. The topic that we’re going to be thinking about is multimodal models and where the future is. I guess I’d like to start with what do you see as, kind of, the key benefits and uses of a multimodal model. Maybe we’ll start with Hoifung. Give us a background of where you are at with multimodal models, what you’re working with them on, and where you see them really shining.

HOIFUNG POON: Thanks, John. Very excited to be here. As, John, you mentioned, one of the really, sort of, like, really exciting frontier is to advance multimodal generative AI, and for us, particular exciting area is to apply this into precision health. And so … cancer is really, sort of, the poster child for precision health, right. So, for example, one of the really cutting-edge treatments for cancer these days is immunotherapy; that works by mobilizing the immune system to fight the cancer. And then one of the blockbuster drugs is Keytruda, that really can work miracles for some of the late-stage cancer, and the annual revenue is actually above $20 billion. Unfortunately, only 20 to 30 percent of the patients actually respond. So that’s really, sort of, like, a marque example of what are the growth opportunity in precision health. If we look back at the past couple of decades, one of the really exciting things happening in biomedicine is the rapid digitization of patient data. That goes from medical imaging to genomics to medical records, clinical notes, and so forth. So every day, if you look around, there are literally billions and billions of data points collected in, sort of, like, clinical care, routine clinical care, that document this very high dimension patient journey from diagnosis to treatment to outcome. For example, a cancer patient may have hundreds of notes, and but also crucially, it will have information from, like, radiology imaging, like from CT to MRIs, and so forth, and by the time when the cancer patient get biopsies or resection, then you will also get digital pathology, you’ll get genomics, and so forth. So one of the really exciting opportunities is that all this, kind of, modality are trying to tell you something about the patient, right, but each of them are very limited in its own right. So I like to liken it to, sort of, like, blind folks touching the elephant, right. So each modality gives you one piece of the elephant, and only by, sort of, kind of, like, combining all those modality, and we can recapitulate, sort of, like, the holistic representation of the patient. So that’s, sort of, like, what we see the most exciting opportunity is: can we learn from real-world data at the population scale to be able to train very powerful, sort of, like, frontier biomedical models that can create this kind of multimodal patient embedding that synthesizes a multimodal longitudinal journey of a patient to essentially serve as a digital twin? Then you can start to reason about it, to find patient like me, right, at population scale, to figure out what works, what doesn’t work, and so forth. We start to actually see some of the promising, kind of, proof points by working with large health systems and pharmaceutical companies, clinical researchers, and so forth.

LANGFORD: So fusing different modalities to get a holistic picture of a patient in order to analyze what kind of treatments may work and so forth …

POON: Precisely. The very minimal is that you can start leveraging that very high-fidelity patient embedding. Like, for example, today for cancer patients, people will like [say], let’s go find a second opinion, right. With this at the population scale, you can get 5 million second opinions, right, to find all the patients like this person, and now you can interrogate what are the treatment that people have tried, right, and what actually works? What doesn’t work? Now you can start to, you know, make better decision-making so there’s immediate benefit but more importantly is that you can start to also … like, for example, in the Keytruda case, you can start to interrogate, who are the exceptional responder versus those 70 percent, 80 percent non-responders? How are they different, right? That would give you a lot of clue about, sort of, like, why the existing drug and target doesn’t work, and that could potentially drastically accelerate, kind of, discovery.

LANGFORD: All right, thank you, Hoifung. Katja, do you want to tell us a little bit about your multimodal models?

KATJA HOFMANN: Sure. Interestingly, the kinds of applications in the space we’ve looked at in my theme is very, very different from those applications in precision health that Hoifung just mentioned. We have looked at one of these fundamentally human activities of creative ideation, and we’ve been exploring the potential of generative models for this in the context of game creation. So coming up with new ideas for video games is something that, of course, people are doing on a very regular basis. There are 3 billion players on the planet that rely on getting very diverse, engaging content in order to create these really immersive or connecting experiences. And what we’ve seen looking at generative models for this is that, one, there is a huge potential for that, but at the same time, we still need to push on capabilities of these generative models, for example, in order to support divergent thinking or to allow people to iterate and really control the kinds of things that these models are able to produce. In my team, we have focused on initial models that are multimodal in the sense of modeling both the visuals of what a player might see on the screen as well as the control actions that a player might issue in response to what’s on the screen. So those are the two modalities that we have so far. And with the models that we have trained in the space, we see that they have this amazing capability of picking up, for example, an understanding of the underlying game mechanics, how different characters interact with each other, and they also provide some amount of ability to be modified by their users. So, for example, creatives could inject new characters or new components and then just work with this material to envision how different gameplay situations might work out. So I see a lot of potential for models like this to support creative ideation in many different application domains. Over time, we will understand how we can add additional modalities to this kind of material for creators. And I think we’re only at the beginning of exploring what’s possible and what new kinds of things people will come up [with] when they have this new material at their disposal.

LANGFORD: So just relating to Hoifung’s, it seems like Hoifung probably deals with a lot more data access– and incomplete data–type issues than you may be dealing with, but then there’s also, kind of, a creative dimension of things never tried before, which you would like to address that maybe is not so important in Hoifung’s case.

HOFMANN: That’s a really good reflection. I think this already points to some of the key both challenges and opportunities in the space. One opportunity is just the fact that Hoifung and my work is, in many ways, so similar; it’s really quite striking. We’ve seen a, kind of, confluence of models in the sense of we have a really, really good understanding of some of the things that work really well, especially when we have the right data and when we know how to scale them up, but there are also some fundamental questions on, how do we deal with partial incomplete data? How do we curate and create datasets? How do we understand the impact of the variety, quality, and scale of data that we have at our disposal on the ultimate quality of the models that we can create? In our case, like you say, we build on, kind of, the rich data that we can obtain from game environments. And that also means that we can inform some of this research. We can build insights on how to best use this kind of data that might actually benefit Hoifung and his team in building better models, ultimately for precision medicine, which, I find incredibly exciting. And then there are dimensions where we really look for very different things. Precision, of course, accurate results is very, very important for any application in the health space. Whereas in our case, for example, capturing the full diversity of what might be possible in a given game situation or pushing on boundaries, creatively recombining elements, is something that we might be looking for that may be much less desirable in other applications.

LANGFORD: Thank you. Jianwei, can you tell us about your multimodal models?

JIANWEI YANG: Yeah, yeah. Yeah, so hi, everyone. I’m very glad to be here to discuss about the multimodal models. So my background is more from computer vision. So I started my computer vision journey roughly maybe 10 years ago. Now actually, we are mainly talking about the multimodal model, so typically, actually, the multimodal model actually covering about, for example, the vision and the language and how we can combine them together. In my opinion, I think at high level, this kind of multimodal model actually can really, OK, help in a lot of, kind of, applications or a lot of, kind of, scenarios in that it can really help to capture the waters around us. So it have the visual input, and it can capture what kind of object and what kind of relationship and what kind of action in the image or videos, etc. On the other hand, actually, this kind of multimodal model, by connecting this vision and the language, can really help the model have the communication capability with humans, so that humans can really have a conversation and can chat with the model and then prompt the model to really finish some task that the human or the user is required to do. So overall, I feel there are a bunch of applications in these kind of multimodal scenarios. So from my point of view, so at a high level, it can be applied to not only the digital world, as Hoifung and Katja just mentioned, in their health and the gaming scenario, but also it can be applied to the physical world, right. So if we really have a multimodal system or AI agent that can really, OK, understand the whole environment, the physical world, and then have a very good communication capability, actually it can be deployed to, for example, autonomous driving system or even a real robot, right, so that we can really have a very good, kind of, copilot or something like that to help us to do a lot of daily tasks. This is quite exciting domain, but also, actually, we are still just at the beginning of this journey.

LANGFORD: So relative to what Katja and Hoifung talked about, are you thinking about more general-purpose multimodal models, or are you thinking about individual special case ones for each of these individual applications?

YANG: Yeah, I think that’s a very good question. So I think all of these applications also actually share some, kind of, basic rules. So in terms of the model building, actually we really need to care about the data, care about the modeling. So I will roughly talk about the modeling part. To really, OK, have a capable multimodal model, we need to encode different information from different modalities, for example, from vision, from language, from even audio speech, etc. So we need to develop a very capable encoder for each of these domains and then how to tokenize each of these raw data. So the pixel is raw data, the speech is raw data, and then we need to develop some good model to tokenize each of the modalities so that we can project the world of this modality into the same space and model the interaction across different modality data so that it can be really used to accomplish some complicated task for each of the domain. In my opinion, I think we have shared a lot of, kind of, common interest across different applications. For example, in our team, actually, we have been doing a lot of research toward the general-purpose multimodal system, and in the meanwhile, actually, we have great collaboration with Hoifung’s team to deliver some kind of domain-specific model, like LLaVA-Med, like BiomedJourney, etc., for the conversational medical AI system and also for the medical image generation and editing or prediction. So all of these are, kind of, sharing some, kind of, basic component in terms of modeling.

LANGFORD: All right, thank you. Maybe a question to, kind of, sharpen the discussion here is, what is, sort of, the top one multimodal challenge that you guys are running into? Hoifung, maybe you can start.

POON: Sure. Happy to. First, I want to echo Katja and Jianwei. So I think one of the really exciting things, I think, is that, sort of, like, a lot of the commonality, right, in a lot of this work. I think that also speaks to a more, kind of, general trend in AI that people, kind of, call it sometimes, like, great consolidation in AI. So, for example, I come from NLP background; Jianwei come from computer vision background, right. We could be friends in the past, but we probably rarely share a lot of the actual, kind of, modeling techniques in the past. But these days, a lot of the underpinning across this different modality, right, are actually quite common, for example, powered by transformer—at least that’s one of the prominent paradigms, right. So that really opened up a lot of the, like, cross-pollination as Jianwei and Katja are alluded to, right. Because you can use transformer to model imaging; you can model video, model text, model protein sequences, and all that. So that’s a super, super exciting thing, and that’s what you see across the board is, like, you see some breakthrough in one of the modality, and often very quickly, some of that can translate to some other, right. Now, back to your earlier question, I think, and one thing—as you alluded to earlier, right, John—is that in biomedicine, I think there are some specific challenges. No. 1 is actually, sort of, like, a lot of this obviously are very high stake, so we have to take our most, kind of, care about stuff like privacy, compliance. For example, all of our collaboration, the data and all the compute actually happen in our partner, like, for example, health system is strictly within their tenant, and we work within their tenant to collaborate with them so make sure that all these are super buttoned up. There are some immediate, sort of, like, pragmatic challenges, and then also, if you look at, sort of, like, across the board, there is some infrastructure challenges, so, like, for example, before the day when a lot of those data were actually on the cloud, it’s very difficult, right, to do a lot of the on-prem computing. But nowadays, because a lot of data start to, you know, getting into the cloud, that makes a lot of this kind of AI progress a lot easier to apply. And then from a specific, kind of, like, modeling challenges, we benefit a ton from the, sort of, like, general-domain progress, first and foremost, right. So a lot of our, kind of, basic modeling actually … that’s why we build on a lot from Jianwei and his team’s great work, right. However, also, when you look at the biomedicine, there are also very specific challenges. Like, I will start from an individual modality, right. So, for example, if you look at, like, the current frontier model, let’s say, you know, GPT-4 and so forth, they are actually really, really good at reasoning and understanding biomedical texts, right. But once you go beyond, sort of, like, go to the non-text modality—once you look at CT, MRI, digit pathology, … and so forth—then those frontier models, this exposes their competency gap. And the challenge is that biomedical text, actually, there is a ton on the public web, right, that GPT-4 was able to consume. Like, PubMed alone have 32 million biomedical papers, right. But when you think about, like, multimodal longitudinal patient data, those are not really, you know, exists a ton in the public web. So that creates lots and lots of, kind of, competency gap. There are lots and lots of unique challenges. For example, take digital pathology as an example, right. A pathology whole slide images could contain billions and billions of pixels, and it could be hundreds and thousands of times larger than a typical web image. And that means, like, standard ways to use transformer completely blow up, right, because you probably need quadratic, which means like billions of times of compute for a single image, right. And CT and MRI are not 2D images, and they are also very, very different from 3D video [that] normally you would find in the general domain. So all those, kind of, like, present very exciting challenges for individual modality, so that create a lot of exciting research opportunity by saying, hey, what are some of the modality-specific inductive biases, right, that we can harness to do, you know, modality-specific self-supervision pretraining, and so forth, right. And so we can do effectively dimension reduction in individual modality. But once we do that, there was still a big, big challenge, as Jianwei alluded to earlier, which is that, for example, think about, right, like a tumor lesion in a CT image may be mapped to a very, kind of, different space in the embedding space compared to the tumor lesion in digital pathology because they are, kind of, like, independently self-supervised pretraining. So then came, actually, another—and I would say even much bigger—challenge, is, like, how do we handle this kind of multimodal complexity? So, as Jianwei alluded to, how can we ensure that those tokenization for individual modality, they are actually aligned, and so now you can actually put them together to start doing effective multimodal reasoning, right? So, like, from an NLP background, you can think about this as a little bit like in translation. So the world has hundreds if not thousands of languages, and one of the key challenges is how can we enable, sort of, like, communication across those languages, and one of the very effective approach, right, that the machine translation community have come up with is this idea of introducing a resource-rich language as an interlingua. So, for example, if I have a language from Africa versus a sublanguage in India, then maybe pretty much there’s, like, zero parallelized data between them, right. So I don’t know how to translate between them. But if I can actually learn to translate that African language into English and then translate from English to that sublanguage in India, then we would be able to successfully bridge the two languages. And in multimodal, actually we see an emerging opportunity to do the same thing, and we basically are using the text modality, right, as the interlingua. The reason is that for any modality under the sun, the study of that modality typically involves a natural language, so when you get an image like a radiology image, you often have accompanying radiology report. When you have a digital pathology slide, you usually have a pathology report, right. So then using that language as a, sort of, a natural language supervision to ground the embedding, to nudge the embedding space on individual modality to the, kind of, like, the common semantic space represented by the text modality. And also, in this way, we can also capitalize on the dramatic investment in the frontier model that are very, very good on the text modality, right. So then if we can make sure that all the modality roughly align their common concept into the text semantic space, then that can make the reasoning much, much more, kind of, like, easy. And a lot of this actually … like, for example, Jianwei mentioned about the LLaVA-Med work, right. That really, sort of, [was] inspired by the general-domain LLaVA work in vision-language modeling. But then we can actually generalize that into biomedicine to start using that modality-text pair to ground the individual modality.

LANGFORD: So a couple of comments. One of them is, what you described at the beginning, I’ve heard described as the “great convergence.” So it used to be that there were vision conferences and NLP conferences and machine learning conferences—and there still are. But at the same time, suddenly, what’s going on in these other conferences is very relevant to everybody. And so it’s interesting. … Suddenly, lots of more things are relevant than they were previously. And then amongst the different challenges you mentioned, I’m going to pick, I think, on the competency gap. Because it seems like it’s an interesting one that, kind of, applies potentially across many different folks. I think that the high-stakes nature of medical situations is also a very important one but specific to the medical domain. So the competency gap I’ve seen described elsewhere, and there’s been a number of studies and papers. That’s an interesting challenge, definitely. Katja?

HOFMANN: Can you remind me what you mean by competency gap? I missed that maybe …

LANGFORD: The competency of these models in the vision modality is not as good as it is typically in a text modality. In a text modality, they’re pretty good in a lot of situations. But in the vision modality, there are simple things like, you know, what’s the relationship between this object and that object in the picture where these models can just not really succeed.

POON: Yeah, so specifically, Katja, right, like, for example, first, as John alluded to, right, for things beyond text, let’s say image and speech, right, and even in the general domain, there may be, sort of, like, already some challenges that could actually certainly have growth space. But once you go to a vertical space like biomedicine, then those competency gaps are actually much, much more pronounced, right. So if you ask some of the state-of-the-art image generator to say, hey, draw me a lung CT scan, they will actually draw you a glowing lung. [LAUGHS] It has no idea what is a CT scan. And across the board, you can see this. Like, there may be some particular classification task or something where some of those data are actually in public, so there may be some academic datasets and so forth. In this case, the frontier model may have seen them and been exposed to them, so they may have been able to internalize some of that, right. So if you ask some of the really top frontier models, they at least have some idea, like, oh, this should be an x-ray; this should be a CT scan; this should be … . And sometimes, maybe even some detailed information, maybe also have some idea. But once you go really, really deep into those like, hey, what is tumor lymphocyte infiltration, right? They have no idea, right. You can argue also, like, it’s very difficult to wait for the frontier model to, let’s say, wait for it to actually parse all the quality tokens in the public web because a lot of those kind of multimodal patient data are simply nonexistent in the public web for a good reason, right. So that basically put a very, kind of, important responsibility for us to figure out what could be a scalable AI methodology for us to quickly, efficiently bridge those competency gaps for individual modality and also for enabling combination, synthesizing of them, as Jianwei alluded to earlier.

LANGFORD: So, Katja, what’s your top one challenge?

HOFMANN: There are many, many challenges. I found myself agreeing with Hoifung on many of them, and I was going through a couple of reactions. So let me start with thoughts on the competency gap and also that interlingua, whether natural language is going to be that, I don’t know, shared modality connecting across all of them. I was just, as you were speaking, thinking through that it’s weird in the sense that some of our key insights in this space come from large language models. So really a model that is started … because that data was most readily available maybe, we have a lot of our insights from specifically language, which, of course, in our own human experience, doesn’t come first. We experience the world through vision, touch, and all our other senses before we start to make sense of any of the language that is spoken around us. So it’s really, really interesting to think through the implications of that, and potentially, as we start to understand more about the different modalities that we can model and the different ways in which we combine them, some of those initial insights may no longer hold. In our examples of gameplay data restored from visuals and controllers, we get this really highly consistent model that is able to generate 3D space, how characters interact with the space, how game mechanics work, cause and effect, and things like that, which is something that is quite different from what has been observed in many other models. So I think there is quite a lot of hope, kind of, that we will be able to arrive at models that have a much more holistic understanding of how the world works once they no longer rely on language as their primary sense, for example. And because of that, I’m also not sure whether indeed natural language will be that shared underlying thing that unites modalities. There might be other, better choices for that because language … it is abstract. It can capture so many things, but it might not be the most efficient way of communicating a lot of the things that we can grasp in other modalities. I’m really, really curious where that is heading. But coming back to your question, John, on what I see as the biggest challenge. It feels like we’re at such an early stage in this field. I could map out tens, hundreds of questions that are really, really interesting, starting from representations and data to model architectures to the way in which different model components might be trained separately and might interact in, kind of, societies of models or agents, if you will. But what I’d like to bring this back to is that for me, personally, the way we ultimately use those models, the way we understand how they can fit within our, kind of, human practices, to really make sure that when we build these models and they have the potential to change the way we do our work or play games or interact with each other, that we should be focusing on making sure that they are as useful, as empowering to their users as possible. So I think we could do a lot more in understanding not just purely what we can build but also what is the need. What are the kinds of capabilities we need to imbue our models with so that we could really drive progress in the right direction and make sure they empower as many people as possible?

LANGFORD: So I’m getting two things from what you’re saying. One of them is there’s a little bit of disagreement about what the ideal, long-term interlingua is between the different modalities. One of them could be language; one of them could be more of a vector space representation–type thing.

HOFMANN: Yup.

LANGFORD: And then the challenge that you’re pointing out is understanding how to apply and use these models in ways which are truly useful to people.

HOFMANN: And not just apply and use the models we already have, but how do we inform that next wave of model capabilities, right? We’ll train models on whatever data we have. But then there’s also the more maybe demand-driven side of in order to unlock this next generation of applications, what do we need and how do we systematically go about then putting the right pieces together? For example, collecting the data that is needed to drive that kind of innovation.

LANGFORD: Jianwei, your turn. What is the primary challenge you see with multimodal models?

YANG: Yeah, I think there are some great, kind of, debates in the whole domain regarding, OK, the vision … whether we should take vision as a core or take language as a core. I want to share two cents about the discussion Katja and Hoifung just made. I still remember at the very beginning of the deep learning era, actually, when people [were] talking about deep learning, usually they mentioned probably ResNet or AlexNet or ImageNet, this kind of vision-centric benchmark or model. But right now, actually, when people talk about deep learning, talk about the AI, usually we mention, OK, the large language model, etc. You can see, OK, there’s some kind of transition from vision to language. I think this also implies some kind of challenges or some kind of, you know, methodology transition. So at the very beginning, actually, all of these different modality, the researchers are trying to, OK, collect some labeled data and train some supervised model to classify the image, classify the text, or etc. But later on, actually, people—especially in the language domain—actually come up with a new idea of self-supervised learning, so that, OK, the model can really learn from a huge amount of unlabeled data, and this huge amount of unlabeled data actually are existing already there from the internet, and people can create a lot of text data, and by nature, another benefit of this language data is that, OK, it can be naturally or natively tokenized. It’s not, like, OK, the image or speech, actually, we cannot handle the hundreds of millions of pixels in a single image or in a single video. But for language, actually, it’s pretty compact, as Katja just mentioned, it’s pretty compact and also representative so that, OK, people can easily tokenize them, convert the language into some, kind of, discrete IDs, right, and then encode this discrete levels, discrete IDs, in the feature space. So I can feel that, OK, so now the vision is lagging behind the language in general. Even though I’m from the vision side, actually, I can see, OK, a gap is there in terms of how we can really build a self-supervised learning system to learn the visual representation which can match the power of the language representations so that we can really merge or bridge the gap as, John, you just mentioned, right. So different modality definitely have the gap. But the vision side actually lag behind, and we need to handle that. And talking about this, I think one of the big challenges to build a very capable multimodal model is [dependent on] how we can bridge the gap by, OK, bringing the vision representation, bringing the vision tokenizer up to the similar level of the language representation and the language tokenizer so that we can really have a very good, kind of, intimate interaction across two different modality. One last point I want to comment is that, OK, so whether the whole system should be language native or vision native or in any other modality, I think this also depends on what kind of applications. For some of the applications, actually, for example, in Hoifung’s domain, the Health Futures, actually, people need to handle a lot of document, but on the other hand, actually, in, Katja, your gaming domain, actually, the model need to handle a lot of pixels and handle a lot of, kind of, reasoning or temporal, kind of, planning, etc. So in my opinion, I think this really depends on what kind of scenario we are really handling. Some of the tasks actually need a lot of, kind of, good representation of the language because it’s language heavy. But for some other tasks like autonomous diving or robotics or planning or visual planning, etc., actually, it’s more rely on whether the model really understand the visual signals from pixel to the middle level, kind of, segmentation to the high-level interaction across different objects in the environment or in the physical world. So I can feel there are still some, kind of, difference, and it’s still not merged, and this is, I think, why I feel it is very exciting to further push forward in this domain, yeah.

LANGFORD: So I think what I’m getting from you is the capability gap is, kind of, a key concern, as well, is that right?

YANG: Yup, yup.

LANGFORD: OK, so capability gap, capability gap, and understanding how to really use these models in ways which are truly beneficial to people. All right, so maybe given this and given the other discussions, what are your predictions for closing these gaps and figuring out how to truly use these models in effective ways? Do you have thoughts for where we’re going to be a year from now, three years? Even three years is, kind of, far at this point. Five years? I think that trying to predict the future is one of these things where you always feel a little squirmy because it’s hard. But at the same time, I think it could be really valuable to get your sense of where things are going to go and how fast. Hoifung?

POON: Yeah, so first, I want to, sort of, kind of, echo what Katja and Jianwei point out, right. I actually don’t see it as a disagreement because I think, actually, all of us are talking about the same thing, which is that all those modality, we want to embed them into a common semantic space, right. So all of them are vector representations, right. And, for example, even when I am mentioning about this text as an interlingua, it doesn’t mean that we are actually trying to map those images and put in sequences literally to text tokens, right. So that could be one way you can do it. But, actually, the much more general way is using the text signals. And in fact, the texts, actually, are also mapped to the embedding space before all this, kind of, modeling happened, like, for example, in LLaVA-Med, right. So the goal, I would say, the end goal, is the same. The challenges for, sort of, like, just looking at … like, for example, a very common paradigm is contrastive learning, and a great example is CLICK, right, and that handles two modalities by saying, let’s push those two modalities into a common semantic space. That works fine when you just have two modalities. Once you have three, [LAUGHS] you start to get a three-body problem, and once you have four and five and six, and that’s where actually the combinatorial explosion happens, and that’s actually why we, sort of, like, have to motivate us to say, hey, can we reduce the combinatorial explosion to maybe a linear, right, like, alignment to some, potentially, some, like, pretty well-developed, kind of, semantic space, right. And texts, I think, for one aspect, as Jianwei alluded to, has some advantage in the sense that if you look at the public web, right, so texts are not just a bunch of words, but they actually capture a lot of human knowledge inside, right. So, for example, if we can, sort of, like, have a gene like EGFR, it may not mean anything, right, to folks who don’t study genomic, but for folks who study genomic, then EGFR actually is a very important gene, and everybody immediately conjure up, oh, this is connected to lung cancer because a lot of lung cancer are caused by mutation in the EGFR gene, right. You also can easily conjure up the sequence for the EGFR. You can conjure up knowledge about how the protein encoded by the EGFR actually bind to other, kind of, protein and what kind of pathway or function do they control. So all this, kind of, knowledge are actually captured not just by this single modality of the sequence or anything but actually by mapping to the text semantic space. There is one to one, so you can immediately have access to all those, kind of, vast amount of knowledge, right. And also, I want to quickly, sort of, acknowledge what, Jianwei, you mentioned about it’s fascinating, like, to think about the self-supervision landscape in computer vision versus NLP, right. Early days as an NLP person, I’m very jealous about vision because you have so many translation invariance, right, rotation invariance, so you can use all those kinds of things to, you know, create synthetic training data. And so until the day in NLP, when people figured out, like, masked language [modeling], when you start to play hide and seek with the words, then that become also a very powerful self-supervision. If we directly apply that heuristic to computer vision to try to match a patch or pixel, that’s not as effective, right, at language so far. So exactly to your point. It’s fascinating to think about different kind of modality, different kind of inductive biases. But I would say fundamentally they all boil down to, I mean, John, you know this all too well, right. It’s like what, kind of, sort of, like, training data, sort of, like, general sense, right, are available. So the single modality data, unannotated images, and so forth, they are most abundant, right, so then there are all this, kind of, exciting work to how to harness them. I would also argue that the text is not a perfect interlingua, but you can think about them as a second, sort of, source of free lunch because all this modality, often they have some accompanying text associated. But I would be the first to point out that that’s far from being complete, right. For example, think about, like, five years ago, we don’t have the vocabulary called COVID, right, even though that virus molecule could exist in some imaginary space already, right. Even today—as, Katja, you alluded to—if you look at, for example, a radiology report, it doesn’t capture every single thing captured in the images, right. In fact, even the best radiologists may not even agree with themselves … written six months ago about the same interpretation, the same report, right. So I think some of the fascinating potential would be, can we actually also ground it with obviously the long-term outcome, right? For example, the patient six months later, did the cancer recur or not? That’s much less ambiguous compared to the signal that is immediately available for the modality. So that’s, kind of, like, some of my thoughts. But to directly answer your question, John, I roughly think about, sort of, like, a lot—like both Katja and Jianwei mentioned—about the really exciting prospect is that ultimately this is not even just an intellectual exercise. It can actually really bring huge benefit, right, potentially to real-world impact, right. And when I think about what kind of real-world impact, I can see a continuum between what I will loosely call productivity gain to creativity gain. And, John, to answer your question, I see already a lot of, kind of, like, high-value, low-risk opportunity where those are the places where typically it’s, like, human expert can perfectly do the task. They just could be very repetitive, could be very boring. [LAUGHS] But in this case, like, I think a lot of the multimodal GenAI already get to a point where they can already assist the human to actually do some of those tasks at scale, right. And also the beauty of that is exactly because human expert can perfectly do the task, this is very naturally you can do a human-in-the-loop setting. So then you can ensure accuracy and ensure that actually the AI … the error can be easily corrected, and then the error can feed back to improve the AI system, right. So some of the examples … for example, like, think about again, go back to a cancer patient. Unfortunately, oftentimes, for late-stage cancer patients, at least, a majority of the standard care doesn’t work, right. So that leaves, like, for example, clinical trial to be the last hope. But every year, for example, in US alone, 2 million new cancer patients, and then every single time, there are thousands of, you know, trials, right. If you go to CDC.gov, half a million trials. And so how do we actually match a trial to a patient? Today, they are basically hiring the so-called trial coordinators, right. Basically [they] manually try to look at the patient record, look at a trial, and that’s completely not scalable. But you can imagine basically, like, we can learn this patient embedding that captures all the important information about the patient, and you can also embed the trial, right, because trials actually are specified by this eligibility criteria, like what kind of property I want to see in the patient so that I will include in the trial. Then once you actually got that in the embedding space, then matching, you can do it 24/7, and you can do what people call just-in-time clinical trial matching. Actually, a few years ago, this is still a novel concept. Nowadays, actually this have already been, you know, applicable to many places. I just want to highlight one example. For example, we are very fortunate to collaborate with Providence, which is the fourth-largest health system in the US. They started using our AI research system actually daily now to help their tumor board to, actually, do this trial at scale. So examples like that actually become more and more available, right. Like, for example, with digital pathology, you can actually use generative AI to learn a very good model to say, classify the subtypes of the cancer, and so forth, and then have human expert to, you know, do … checks and so forth. And so that’s already happening, right. And that’s actually even now, right, that lots of this are already happening. But looking forward, I think the most exciting prospect is … so loosely, I would call it creativity gain. That’s actually in the regime where even the best human expert has no idea how to do it. So, for example, like, digital pathology slide, you look at it, you can discern, here’s a tumor cell; here is the lymphocyte, or the immune cell; here are the normal cells. And looking at the configuration, right, it actually gives you lots of clue about whether the immune system already are alerted by the cancer, and thereby, it can determine to a large degree whether the immunotherapy can work, right. But right now, pathology, even the best pathology, they only have a very weak heuristic about, OK, let me count how many lymphocyte within the tumor region. So using that as a very rough proxy, it can do somewhat better than, you know, right now if you don’t look at it at all. But arguably, there could be tons and tons of subtle patterns that actually even the best human experts have no idea about today, but we could potentially have generative AI to potential [inaudible]. I think that would be the most exciting prospect.

LANGFORD: OK, so human-in-the-loop generative AI visual pathology, for cancers and immunotherapy. How long is that? One, three, five years?

POON: I would say that there was some immediate application, let’s say, to pick out, like, classify one of the subtype of the cancer. Right now, like, the best generative AI models are already doing real well. So, for example, we recently have a Nature paper …

LANGFORD: How long until it’s deployed?

POON: So right now, for example, at Providence, they are already looking into actually using that as part of the workflow to say, hey, can we use that to help the subtype …

LANGFORD: So within a year.

POON: Yeah, and also, there are some additional applications, for example, like, predicting mutation, right, and then that could be solved by uncovering some of the underlying, kind of, genetic activities and so forth. But I would say the holy grail here would be to actually go from that 30 percent response rate for immunotherapy, right, to, you know, much, much higher, right. So that one, I think there has already been some study, but usually on a smaller scale, right, of patient and also hasn’t actually really incorporated a lot of the really state-of-the-art, kind of, generative AI technique, right. Right now, I think the really exciting prospect is, like, for example, one science paper two years ago, right, had actually 200 patients, and if you don’t have a very good representation of the digital pathology slide or including the general patient embedding, then 200 data points doesn’t give you that much signal, right. But on the other hand, if you can learn from billions and billions of, you know, pathology images from millions of slides, right, then you can learn a very good representation. Then building on top of that kind of, like, representation, you could actually start to learn much, much more efficiently from the long-term outcome, right. And I would say the first stage of, like, the productivity gain … like, I also want to highlight, for example, there were exciting partnership, like, some of our sibling team with Paige AI, right. So they have actually also demonstrated, like, for example, that you can do clinical-grade kind of, like, classification for certain tasks today, right. But the holy grail would be to, you know, go to this, kind of, like, modeling tumor microenvironments and actually predicting immunotherapy outcome. I can’t really predict when will we get to that above 90 percent of the response rate prediction, but I think there is a huge growth area. Conceptually, we already see some tiny bit of promising results, even by using very, sort of, low-dimension features and smaller number of data points. So I’m pretty optimistic that if we can scale the data but also crucially scale and improve the representation learning, then we can get much, much better results.

LANGFORD: That’s great. Katja, your predictions for the future?

HOFMANN: Let’s see. One, three, five years? I’m very optimistic that in the shorter term—maybe I have a biased view—data from games can be very, very influential in helping us answer some of those fundamental questions on how to train and use multimodal models. The, kind of, degree to which we can collect rich data, multimodal data, in a reasonably controllable way at reasonable scale just makes these kinds of environments a prime candidate for driving insights that now, in this new world, are much more generally applicable than they might have been in the past. So I would say we’ll definitely see the benefits of that in the coming year. Within the three-year horizon, one thing that’s really interesting and also, kind of, connects to my journey as a researcher is that over the past 10 or so years, I’ve invested a lot of my effort—and my team has invested a lot—in things like reinforcement learning and decision-making. And so now with generative models, we’ve primarily seen the power of predictive model trained at massive scale. Yes, there are some RLHF [reinforcement learning from human feedback] around the edges, but I think we will see a return to some of those fundamental questions around reinforcement and decision-making settings. But building on top of the insights that we’ve gained from training these large-scale multimodal generative models. And I would say that in around three years, that’ll be back in full force, and we’ll have a lot of fun benefiting from maybe some of those, kind of, further back insights that maybe, John, you and our teams have developed over time. So I’m very much looking forward to that. Five-year horizon? You’re right. It’s hard to predict that, especially as things seem to be moving so quickly. But one thing that I would expect in that time frame is that some of the innovations that we are working on today will make their way into the hands of actual end users, and so I believe that starting to understand what that new human-AI collaboration paradigm could look like, that is something that everyone with computer access would be able to experience within the next five years.

LANGFORD: OK, excellent. Jianwei?

YANG: Yeah, honestly, I always think it’s very hard to predict the future in such a five-year horizon, so I want to share an old Chinese saying. So it says, OK, so “read 10,000 book and walk 10,000 miles.” I’ll say in the past maybe five years since GPT or since the transformer era coming, so basically, OK, we have been striving for making the model to read thousands of books or hundreds of thousands of books, right, so from the internet, from the Wikipedia, etc. So the model itself right now, like GPT-4 or many other, kind of, open-source models, actually already got a lot of knowledge from the textbooks. They have a very good understanding about how the world actually is operating, right. But the problem is that, OK, this knowledge are not grounded at all. This knowledge is great, but this knowledge actually are not grounded on any kind of observation, any kind of digital world or physical world. In my opinion, I think in the next few years … even actually is already happening now. People are trying to ground this big brain learned by reading a lot of books, right, onto this digital world or physical world represented by, OK, image, by video, or by many other modalities, as Hoifung just mentioned. So I can imagine, OK, in the next one or two years, actually, people are really trying to squeeze the knowledge out from the big, or giant, large language model to really, OK, build the connection between this, kind of, heavy and rich knowledge and the visual observations or other type of observations. So this is one thing I can imagine, which would be very likely happening very soon. People are trying to build the connection, and the people are trying to really, OK, make the other part of the model stronger and have some kind of connector in between. After that, I can imagine that, OK, so this kind of progress will probably start happening in the digital world because, OK, as we just mentioned, in the digital world actually, people can obtain the data very quickly. People can create a lot of data from the internet. Gradually, actually if we used up all of the data from the internet actually, we really need to put this system into the physical world. Into the environment that we are living, right. We really want to put this model out, put this system out, and let themselves actually explore in the physical world to interact with the environment and to learn some more knowledges. I want to echo, Katja, what you just said actually. The gaming environment, I would say, is a really great, kind of, simulator or emulator to the real environment. There are a lot of interactions between the agent and the environment, right. So I think this kind of information can really be beneficial to make the model to learn more grounded knowledge about the world. Later on, actually, we can probably really deliver this model, this system … recently, our team actually, we are doing some, kind of, research like, OK, how we can convert the multimodal language model like LLaVA or Phi-3-V to a vision-language-action model. It already learned a good knowledge about vision and language demand but how we can make it more capable of making a decision, make the planning, to accomplish some daily task, like what we are doing in our daily life. This is, kind of, saying, actually, I can imagine it could happen very soon, in maybe a few years. I feel it’s very exciting area for us to really push forward this direction. Also, I’m very, kind of, optimistic that we are seeing very great things in the next few years.

LANGFORD: So do you have timelines for these? What’s your expectation for how long it’ll take?

YANG: Yeah, so talking about the ultimate goal in my mind, like I just mentioned, so how we can build a real AI agent that can traverse in the digital world and also traverse in the physical world. We already see a lot of work in the digital world, like the system we have built. On the other hand, actually, in the physical world, actually, we’re yet to see a real, kind of, robot that can undertake a daily task like ourselves. I can imagine this will require a lot of effort from different aspect. But I’m a little bit probably more optimistic that, OK, it could happen in maybe five years to 10 years, that, OK, we really can buy some kind of real robot, right, and put it in a home and help us to do some household tasks, something like that. I’m probably, yeah, a little bit optimistic, yeah.

LANGFORD: So five to10 years until we have a generative AI robot that can do useful things …

YANG: Yeah, yeah …

LANGFORD: … in your home. Excellent. All right, I think we’ve probably gone longer than they wanted us to. Thank you, everyone. It’s great to hear from you. I’ve actually learned quite a bit during this discussion.

POON: Thanks so much.

HOFMANN: I really enjoyed the discussion. Thanks so much everyone.

YANG: Yeah, thank you for the invitation.

Research Lab AI Frontiers

Research Lab Microsoft Health Futures

Research Lab Microsoft Research Lab – Redmond

The post Panel Discussion: Beyond Language: The future of multimodal models in healthcare, gaming, and AI appeared first on Microsoft Research.

]]>

Keynote: Phi-3-Vision: A highly capable and “small” language vision model

Tue, 03 Sep 2024 19:02:03 +0000

Presented by Jianfeng Gao at Microsoft Research Forum, September 2024

“Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs.”
– Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

Microsoft research copilot experience What are the key innovations behind the Phi-3-Vision model that make it both highly capable and economical?

Transcript: Keynote

Phi-3-Vision: A highly capable and “small” language vision model

Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

This talk introduces Phi-3-Vision, an advanced and economical open-source multimodal model. As a member of the Phi-3 model family, Phi-3-Vision enhances language models by integrating multisensory skills, seamlessly combining language and vision capabilities.

Microsoft Research Forum, September 3, 2024

JIANFENG GAO: Hi. Welcome to Microsoft Research Forum. My name is Jianfeng Gao. I’m a distinguished scientist and vice president at Microsoft Research. Today, I’m going to talk about our latest AI foundation model, Phi-3-Vision, a highly capable and cost-effective open-source vision-language model. The model seamlessly combines language and vision capabilities, and the model weights are released to the public to allow everyone to develop better models on top of it.

First, let me use a few examples to show you what the model can do. A typical use case of a vision-language model is vision question answering, where the model is asked to answer questions regarding an input image. As illustrated in this example, the question—what is the tower building in the image?—requires an understanding of language, vision, and commonsense knowledge to answer. For example, the model needs to not only recognize the tower is the Space Needle but also know that it is one of the most recognizable landmarks in the city and offers panoramic views of Seattle and the surrounding area. Compared to popular language-vision models on the market, including those released by Microsoft, such as Kosmos, LLaVA, and Florence, Phi-3-Vision is not only much smaller but has much stronger understanding and reasoning capabilities, especially in non-natural image domains, such as tables, charts, and diagrams.

As shown in this example, we presented the model a coffee shop menu, which is by no means a high-quality image, and asked such questions as, what is the price of a cappuccino with a large size, how much does it cost to add ice to the tea, and if someone wants to buy a pot of tea, how much would it cost? The model can produce correct answers by reasoning over relevant knowledge extracted from the image, such as “The price is $3.25,” “It costs an additional $1 to add ice to any tea,” and “A pot [of] tea will cost $4.” The model can also extract all the texts from the image and generate a table using the format specified by users, such as the Markdown table and the JSON representation.

Here is another example where the model is asked to generate an insightful report from a chart, and it’s told that the report will be used to make important decisions. We see that the model-generated report is very well structured. It starts with an introduction to what the chart is about, then gives four insights based on the chart, and concludes the report with a suggestion to the decision makers.

Here is how the model works. Phi-3-Vision is a language-vision model designed to process the image and a text prompt as inputs and generate text outputs. The model is composed of two primary components: a vision encoder and a language decoder. The vision encoder, which is based on the CLIP vision transformer model, extracts visual tokens from an input image. Then these visual tokens are concatenated with text tokens and are fed to the transformer language decoder, which is based on Phi-3-mini-128k model, to generate output text. The strong performance of Phi-3-Vision is mainly attributed to the use of a strong transformer language model. A language model predicts the next word based on its context. The complexity of a language model depends to a large degree upon the length of the context it can encode. Encoding longer context often leads to a better model. As in this example, the model needs to encode a long context to include the word “dog” to predict the next word, “barking.” Language modeling is a long-standing research topic dating back to the 1950s [with] Shannon’s application of information theory to human language, where he measured how well simple N-gram language models predict natural language text. However, these N-gram models can only handle very short context because the model size grows exponentially with context lengths.

Traditional neural language models such as recurrent neural networks compress the context to a fixed-sized vector to capture long context while keeping the computational cost manageable. In contrast, transformers can effectively include a very long but uncompressed text via self-attention. This is why transformer models are so successful. Recently, sparse attention mechanisms have been explored to deal with the quadratic complexity of self-attention as the model takes increasingly long input token sequence, as we will talk about in a minute.

Scaling laws suggest that we can keep improving the performance of transformer language models by increasing model size and the training data. As a result, we have witnessed the emergence of many very large language models, such as GPT-4. These large language models show emergent abilities, such as in-context learning, where the model learns to perform new tasks given only a few demonstrations without additional model training. These abilities make larger language models the building block of general-purpose AI systems way beyond language understanding and generation. However, these scaling laws assume a “fixed” data source. This assumption is now significantly disrupted by the existence of frontier language models themselves, which allow us to interact with data in novel ways. For example, it has been reported that a combination of large language model–based filtering of web data and large language model–created synthetic data enables model abilities in smaller language models that were typically seen only in much larger models, such as in-context learning.

This inspires Microsoft to develop a family of small language models called Phi-3 models. These models are highly capable in ways that larger language models are but are far more cost effective. As shown in this figure, the Phi-3 language models are the best performers across the quality and cost curve. For example, Phi-3-mini outperforms the models twice its size, including Llama 3 and Mistral 7B. The Phi-3-Vision model is based on Phi-3-mini; that’s a language decoder. Vision encoder extracts vision tokens from input image. To encode extremely long context due to the large number of vision tokens extracted from high-resolution images, our transformer-based vision encoder uses a sparse attention mechanism based on dynamic cropping.

In this example, we split an input image into 2D attention blocks and build for each block a local attention map by computing attention scores only within the block. To encode dependencies among tokens in different blocks, we resize the high-resolution input image into a low-resolution image so that all visual tokens can fit in one attention block and build a global attention map for the whole input image, although using its coarse-grained version.

The model is trained in two phases: pretraining and post-training. In the pretraining phase, the model acquires general skills for vision-language understanding and reasoning. Phi-3-Vision is pretrained on a diverse dataset of approximately 100 million text-image pairs extracted from web documents, synthesized from OCR of PDF files, and datasets for chart and table comprehension. The post-training phase consists of two stages: supervised fine-tuning and directed preference optimization. Supervised fine-tuning, or SFT, enhances [a] model’s ability to follow human instructions to solve downstream tasks. The training that we used consists of 15 billion tokens, and there is a combination of multimodal instruction-tuning data that covers diverse domains and tasks, such as understanding and reasoning over natural images, like the Space Needle picture I described before, as well as non-natural images, such as charts, tables, and diagrams. Directed preference optimization, or DPO, improves model safety by aligning model outputs with human preference. We used a highly selective preference dataset, which consists of triples. Each triple contains a prompt, a human-chosen answer to the prompt, and a rejected answer. The model is trained to always prefer the chosen answer to the rejected answer.

We evaluate the performance of Phi-3-Vision on AI benchmarks in three categories: science, charts, and generic knowledge. We see that Phi-3-Vision significantly outperforms all the other open-source models, which have much larger model size, on almost all the benchmarks. Compared to the best closed-source models, such as GPT-4V, although there’s still a performance gap on the generic knowledge benchmarks, such as MMMU, on many science question answering and chart reasoning tasks, the performance of Phi-3-Vision is better despite its much smaller model size.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs. Given the popularity of models like GPT-4 and their adoption at massive scale, reducing costs is a very important part of achieving this mission. Phi-3-Vision is the first multimodal model in the Phi small model family. It matches and sometimes exceeds some of the capabilities of much larger models, such as GPT-4V, at a much lower cost. And to help everyone build more affordable and accessible AI systems, we have released the model weights into the open-source community. In future work, we will extend the model to have new abilities such as action planning for embodied AI and robotics, where cost effectiveness is particularly important. If you want to learn more about this exciting field, keep watching for the panel discussion hosted by my colleague John Langford. Thank you for joining us today.

Research Lab Microsoft Research Lab – Redmond

Publication Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Download Phi-3

The post Keynote: Phi-3-Vision: A highly capable and “small” language vision model appeared first on Microsoft Research.

]]>

Research Forum Brief | September 2024 Articles

Fostering appropriate reliance on AI

Transcript: Lightning Talk

Related resources

A generative model of biology for in-silico experimentation and discovery

Transcript: Lightning Talk

Related resources

Project Aurora: The first large-scale foundation model of the atmosphere

Transcript: Lightning Talk

Related resources

Direct Nash Optimization: Teaching language models to self-improve with general preferences

Transcript: Lightning Talk

Related resources

Analog optical computing for sustainable AI and beyond

Transcript: Lightning Talk

Related resources

Panel Discussion: Beyond Language: The future of multimodal models in healthcare, gaming, and AI

Transcript: Panel Discussion

Related resources

Keynote: Phi-3-Vision: A highly capable and “small” language vision model

Transcript: Keynote

Related resources