Research Forum | Episode 5 - abstract background with colorful hexagons

Research Forum Brief | February 2025

Keynote: Multimodal generative AI for precision health

Published

Hoifung Poon

“Using Microsoft AI, Providence researchers were able to find actionable biomarkers for a majority of patients and, consequently, many patients were prescribed with precision therapy, which substantially increased overall survival.”

Hoifung Poon, General Manager, Microsoft Health Futures

Transcript: Keynote

Multimodal generative AI for precision health

Hoifung Poon, General Manager, Microsoft Health Futures

This talk introduces an agenda in precision health, utilizing generative AI to pretrain high-fidelity patient embeddings from multimodal, longitudinal patient journeys. This approach unlocks population-scale real-world evidence, optimizing clinical care and accelerating biomedical discovery.

Microsoft Research Forum, February 25, 2025

WILL GUYMAN, Group Product Manager, Healthcare AI Models: It is my pleasure to introduce my colleague Hoifung Poon, an expert in healthcare AI and general manager of Microsoft Health Futures, to talk about utilizing generative AI to enable precision healthcare. In addition to advancing the frontier of medical AI, Hoifung and Microsoft Research have deeply invested in bridging the gap between research and our clinical partners across the ecosystem. I always leave inspired after hearing Hoifung talk, and I’m sure you’ll feel the same. Over to you, Hoifung. 

HOIFUNG POON: Hi, everyone. My name is Hoifung Poon. I am general manager at Microsoft Health Futures. I lead Biomedical AI Research and Incubation for precision health, with a particular focus on advancing multimodal gen-AI [generative AI], to unlock the population-scale real-world evidence. So, in the ideal world, we want every patient to be able to respond to the treatment they have been prescribed, as signified by the blue person here on the graph on the left.

In the real world, unfortunately, many patients do not respond to the treatment, as signified by the red person here. So this is obviously the fundamental challenge in biomedicine, and cancer is really the poster child of this problem. For example, immunotherapy is the cutting edge of cancer treatment. And, indeed, blockbuster drugs such as Keytruda can work miracles on some of the late-stage cancer patients.

However, the overall response rates still hover around 20 to 30%. Now when the standard of care fails, which is often the case for cancer, clinical trials become the last hope. Here’s Martin Tenenbaum, a successful AI researcher and e-commerce entrepreneur. At the peak of his career, Marty was diagnosed with late-stage melanoma. But fortunately for Marty, he was able to mobilize his network to find a matching trial that cured his cancer.

However, most patients are not as lucky or resourceful as Marty. Even in the US, only a small portion of patients were able to find matching trials, whereas a lot of cancer trials fail simply because they couldn’t find enough patients. Developing a new drug is notoriously hard, taking billions of dollars and over a decade. And this will become increasingly unsustainable in precision health as we actually have to develop more drugs, each applicable to smaller subpopulations.

When we think about drug development, oftentimes the first thing that comes to mind is early discovery. Now, this is indeed super exciting and foundational, but in the grand scheme of things it’s only 10 to 20% of the total costs. Most of the astronomical costs in drug development actually stem from later stages of clinical trials and post-market. Interestingly, this also happens to be the most low-hanging area, with immediate opportunity for major disruptions.

For example, a single phase-three cancer trial can cost hundreds of millions of dollars, and we only get back a few thousand data points. And the whole process is so inefficient. But there is a lot of potential in actually changing this by harnessing AI to unlock population-scale real evidence. In the past couple of decades, there has been rapid digitization of patient records.

And every day there are literally billions and billions of data points collected in routine clinical care about a patient journey from diagnosis to treatment to outcome. At the beginning of a patient journey, even the best doctor doesn’t have a perfect crystal ball on what might happen next. So, each journey is essentially a mini-trial, and each encounter brings forth new information.

If we can crack the code and unlock the insight underneath, this is essentially a population-scale free lunch. So, for example, here is the de-identify journey of a cancer patient, where each bar is clinical notes. So, you can see there are many, many note types, and also each note contains a lot of detailed information about the patient journey.

Additionally, there are a lot of information-rich modalities, from medical imaging to multi-omics. So, each of these modalities is trying to tell us something about the patient, but each is inherently limited. Only by assimilating all these kinds of modalities can we recapitulate a holistic kind of patient representation. So, from a machine learning point of view, precision health amounts to learning a function that inputs a multimodal patient journey and then outputs key medical events, such as disease progression and counterfactual tumor response.

If we can predict them well, we have essentially solved precision health. Now, of course, as you can guess, this is not so easy, right? So, a patient journey is not just a snapshot, but actually a longitudinal time series. More annoyingly, most of the information that we want to have is actually missing, and even the observable information can be very noisy and also contain a lot of biases.

But this is exactly why gen-AI can be so promising for precision health. The underpinning of gen-AI is a generative model, overall the joint distribution of all those chemical variables. So this enables us to compress all the observable information into a patient embedding, which can then help predict the missing information. And then predicting the next medical event is essentially a special case.

So, our overarching agenda essentially lies in how can we harness those population-scale, real-world data to portray a high-fidelity patient embedding that can serve as a digital twin for the patient. And, given the patient embedding, we can then conduct patient reasoning at the population scale. For example, after the cancer diagnosis, instead of spending months and tons of resources to seek a second opinion, we can essentially snap a finger to get millions of opinions from the most similar patients.

We can interrogate the patient journey, such as a treatment pathway and longitudinal outcome. And this can immediately help improve patient care. We can also compare non-responder versus exceptional responder and start probing mysteries, such as why those 80% of patients do not respond to Keytruda. And in this way we can essentially unlock all those emerging capabilities from the population-scale real-world evidence that actually allow us to shatter the glass ceiling of today’s healthcare common sense.

So this is very, very exciting, but the forward path is incredibly challenging. Even the best frontier models have a major competency gap for an ever-growing, long list of non-text modalities in biomedicine. So, over the past decade or so, we have blazed a new trail by essentially conducting curriculum learning over three giant free lunches.

The first free lunch stems from unimodal data, such as nanotech images. So, here a general recipe for self-supervision lies in pre-training modality-specific encoders and decoders. And then that can compress the input into an embedding, and then decompress it back to reproduce the original input. So for text, we can also simply piggyback on existing frontier models that are already very, very good at understanding and reasoning with biomedical texts.

Now, this general recipe is universally applicable and very powerful biomedicine, and there are also a whole slew of, kind of like, modality-specific challenges that require major research innovations. For example, digital pathology is well known to contain a lot of key information about tumor microenvironments, such as how immune cells interact with cancer cells, which is crucial for deciphering resistance to, immunotherapy.

So here, transformer is the workhorse of the gen-AI and in theory is actually perfect for modeling such a complex global presence. However, pathology slides are actually among the largest in the world. A single whole-slide image can be hundreds of thousands of times larger than standard web images, which means that it will require billions of times more computation due to the quadratic growth in transformer. So, to address this problem, a promising direction is to incorporate this idea called dilated attention, which originated from speech recognition that also has a big problem in modeling long contexts. So, for images, transformer essentially works by having pixels passing messages with each other, which is why it leads to the quadratic growth in compute.

So in dilated attention, for smaller blocks and in local neighborhoods, we will still keep using the full self-attention with the pairwise message passing. But when we pass messages in larger blocks, we will instead try to essentially elect representatives for the local neighborhoods and then only pass messages among those two representatives. So for larger and larger blocks, we will elect sparser and sparser representatives. And in this way, we can perfectly cancel out the quadratic growth.

So, by adapting dilated attention to digital pathology and in collaboration with Providence Health System and University of Washington, we have created GigaPath, the world’s first digital pathology foundation model that can truly scale transformer to the whole-size image. And this paper was published by Nature last year, and we are very excited to see that in the few months since its publication, GigaPath has already been downloaded well over half a million times across the globe.

We are super psyched to see the community’s interest. And we have also made a ton of progress in other modalities, such as CT and spatial multi-omics. So, the unimodal pre-training is a very good first step, but there are even bigger challenges. So, for example, a pathology foundation model may learn to map a tumor lesion somewhere in the embedding space, whereas a CT foundation model may map it elsewhere.

Each modality is trying to tell us something about a patient, but each is speaking its own distinct language. So, this is essentially analogous to the translation problem for human languages. And in the translation space, right, to deal with the multilingual explosion, machine translation systems will usually introduce a resource-rich language, such as English, as an interlingua to bridge among those low-resource languages.

For example, there may not be any parallel data between a language in Africa and a sub-language in India, but we can translate from the African language to English and then from English to the sub-language in India. And this is, indeed, how commercial machine translation systems scale to hundreds of languages in the world. So, here we propose to follow the same recipe in dealing with the multimodal complexity in biomedicine by introducing interlingual modality, and text is an ideal candidate to serve as this interlingua.

We already have very powerful frontier models for the biomedical text modality and, moreover, for any non-text modality under the sun. The study of the modality involves natural languages, which means that there are a lot of readily available modality text pairs such as a pathology slide and the corresponding pathology report. We can piggyback on the unimodal pre-training in the first stage by reusing those encoders and decoders, and then focus on using the modality text pairs to pre-train a lightweight adapter layer. And the adapter layer essentially translates from the modality embedding to the text-semantic space. So, this enables all the modalities to start to speak in the same language, and also helps propagate a lot of the rich prior knowledge that has already been captured in the text-semantic space back to individual modalities to help with their interpretation.

So, for more detail about this general recipe, you can check out our LLaVA-Med paper, which was spotlighted in NeurIPS. So, here I also want to add that the LLaVA paradigm also represents a trailblazing innovation at MSR [Microsoft Research] by harnessing the text-processing capability of frontier models to synthesize multimodal instruction following data. So, this has since become a standard practice, including training multimodal Phi and other popular vision-language models.

Now, we can extend this recipe to include a pixel-level decoder for holistic image analysis. So, this enables us to develop BiomedParse, which can conduct object recognition, detection, and segmentation in one fell swoop through a unified natural-language interface. So, you can essentially talk to the image to conduct an analysis. So, BiomedParse is a single foundation model that can attain state-of-the-art performance across nine modalities and six major object types.

It was just published by Nature Methods and, in the same issue, Nature Methods also published an external review that called BiomedParse a groundbreaking biomed AI foundation model (opens in new tab) and said that the implications of BiomedParse are profound. So these are all very, very exciting, but we still have one last giant free lunch that lies in the very patient journeys themselves.

So, recall that GPT essentially learns by predicting next token, next token, next token. Right? And in the same way, our patient embedding can actually learn by predicting the next medical event and next medical event. So, in this way, we can essentially turn every single patient journey into a self-supervision training instance. So we have conducted some initial explorations on the structure of medical events using a public data set.

Interestingly, scaling laws established for text actually are not very far away from the structure of medical events. And we are now extending the study to much larger datasets. So, ultimately, we can imagine the embedding not just for patients, but also for interventions, for clinical trials, etc. And in this way, we can potentially develop a universal embedding calculus for precision health.

As we mentioned earlier, clinical trial is the perfect beachhead. Providence is the third-largest health system in the US, and they have been using our research system daily now in their tumor board to screen thousands of patients a year, including this high-profile trial featured by The New York Times. Using Microsoft AI, Providence researchers were able to find actionable biomarkers for a majority of patients and, consequently, many patients were prescribed with precision therapy, which substantially increased overall survival.

So this is super, super exciting. So, ultimately, the dream is to drastically scale high-quality health care and drastically reduce healthcare costs. And, thereby, we can democratize such high-quality health care to, essentially, really for everyone. So, with the clinical trial matching capability, we can also essentially snap a finger and control our virtual case arm and control arm, and then conduct clinical research, hypothesis generation, and test using real-world, data.

A lot of those marquee lung cancer trials that can cost hundreds of millions of dollars to run can be simulated using real-world data as we have been shown with Providence collaborators, including the original key to the trial.

Now, obviously, with exciting moonshots such as precision health, it takes way more than a village. At Microsoft Research, we are super blessed by being able to collaborate in depth with talented researchers across Microsoft Research itself as well as with academia and with a lot of key health stakeholders, such as large health systems and life-sciences companies. Many of the frontier biomedical models we have highlighted in this talk are already publicly available in Azure AI Foundry (opens in new tab).

Now, obviously, much more remains to be done. But even with what we have today, there is already a lot that we can bring forth in positive disruption to scale drug development and improve patient care.

Thank you.

Continue reading

See all blog posts