Research Forum | Episode 4 - abstract chalkboard background with colorful network nodes and circular icons

Research Forum Brief | September 2024

Panel Discussion: Beyond Language: The future of multimodal models in healthcare, gaming, and AI

Share this page

Katja Hofmann

“I believe that starting to understand what that new human-AI collaboration paradigm could look like, that is something that everyone with computer access would be able to experience within the next five years.”

Katja Hofmann, Senior Principal Researcher, Microsoft Research

Transcript: Panel Discussion

Beyond Language: The future of multimodal models in healthcare, gaming, and AI

Katja Hofmann, Senior Principal Researcher, Microsoft Research
Jianwei Yang, Principal Researcher, Microsoft Research Redmond
Hoifung Poon, General Manager, Microsoft Research Health Futures
John Langford (host), Partner Research Manager, Microsoft Research AI Frontiers

This discussion delves into the transformative potential and core challenges of multimodal models across various domains, including precision health, game intelligence, and foundation models. Microsoft researchers share their thoughts on future directions, bridging gaps, and fostering synergies within the field.

Microsoft Research Forum, September 3, 2024

JOHN LANGFORD: Hello, everyone. I’m John Langford. I’m joined by Katja Hofmann, Hoifung Poon, and Jianwei Yang, each of whom is working on multimodal models of actually quite different varieties. The topic that we’re going to be thinking about is multimodal models and where the future is. I guess I’d like to start with what do you see as, kind of, the key benefits and uses of a multimodal model. Maybe we’ll start with Hoifung. Give us a background of where you are at with multimodal models, what you’re working with them on, and where you see them really shining.

HOIFUNG POON: Thanks, John. Very excited to be here. As, John, you mentioned, one of the really, sort of, like, really exciting frontier is to advance multimodal generative AI, and for us, particular exciting area is to apply this into precision health. And so … cancer is really, sort of, the poster child for precision health, right. So, for example, one of the really cutting-edge treatments for cancer these days is immunotherapy; that works by mobilizing the immune system to fight the cancer. And then one of the blockbuster drugs is a KEYTRUDA, that really can work miracles for some of the late-stage cancer, and the annual revenue is actually above $20 billion. Unfortunately, only 20 to 30 percent of the patients actually respond. So that’s really, sort of, like, a marque example of what are the growth opportunity in precision health. If we look back at the past couple of decades, one of the really exciting things happening in biomedicine is the rapid digitization of patient data. That goes from medical imaging to genomics to medical records, clinical notes, and so forth. So every day, if you look around, there are literally billions and billions of data points collected in, sort of, like, clinical care, routine clinical care, that document this very high dimension patient journey from diagnosis to treatment to outcome. For example, a cancer patient may have hundreds of notes, and but also crucially, it will have information from, like, radiology imaging, like from CT to MRIs, and so forth, and by the time when the cancer patient get biopsies or recession, then you will also get digital pathology, you’ll get genomics, and so forth. So one of the really exciting opportunities is that all this, kind of, modality are trying to tell you something about the patient, right, but each of them are very limited in its own right. So I like to liken it to, sort of, like, blind folks touching the elephant, right. So each modality gives you one piece of the elephant, and only by, sort of, kind of, like, combining all those modality, and we can recapitulate, sort of, like, the holistic representation of the patient. So that’s, sort of, like, what we see the most exciting opportunity is: can we learn from real-world data at the population scale to be able to train very powerful, sort of, like, frontier biomedical models that can create this kind of multimodal patient embedding that synthesizes a multimodal longitudinal journey of a patient to essentially serve as a digital twin? Then you can start to reason about it, to find patient like me, right, at population scale, to figure out what works, what doesn’t work, and so forth. We start to actually see some of the promising, kind of, proof points by working with large health systems and pharmaceutical companies, clinical researchers, and so forth.

LANGFORD: So fusing different modalities to get a holistic picture of a patient in order to analyze what kind of treatments may work and so forth …

POON: Precisely. The very minimal is that you can start leveraging that very high-fidelity patient embedding. Like, for example, today for cancer patients, people will like [say], let’s go find a second opinion, right. With this at the population scale, you can get 5 million second opinions, right, to find all the patients like this person, and now you can interrogate what are the treatment that people have tried, right, and what actually works? What doesn’t work? Now you can start to, you know, make better decision-making so there’s immediate benefit but more importantly is that you can start to also …  like, for example, in the KEYTRUDA case, you can start to interrogate, who are the exceptional responder versus those 70 percent, 80 percent non-responders? How are they different, right? That would give you a lot of clue about, sort of, like, why the existing drug and target doesn’t work, and that could potentially drastically accelerate, kind of, discovery.

LANGFORD: All right, thank you, Hoifung. Katja, do you want to tell us a little bit about your multimodal models?

KATJA HOFMANN: Sure. Interestingly, the kinds of applications in the space we’ve looked at in my theme is very, very different from those applications in precision health that Hoifung just mentioned. We have looked at one of these fundamentally human activities of creative ideation, and we’ve been exploring the potential of generative models for this in the context of game creation. So coming up with new ideas for video games is something that, of course, people are doing on a very regular basis. There are 3 billion players on the planet that rely on getting very diverse, engaging content in order to create these really immersive or connecting experiences. And what we’ve seen looking at generative models for this is that, one, there is a huge potential for that, but at the same time, we still need to push on capabilities of these generative models, for example, in order to support divergent thinking or to allow people to iterate and really control the kinds of things that these models are able to produce. In my team, we have focused on initial models that are multimodal in the sense of modeling both the visuals of what a player might see on the screen as well as the control actions that a player might issue in response to what’s on the screen. So those are the two modalities that we have so far. And with the models that we have trained in the space, we see that they have this amazing capability of picking up, for example, an understanding of the underlying game mechanics, how different characters interact with each other, and they also provide some amount of ability to be modified by their users. So, for example, creatives could inject new characters or new components and then just work with this material to envision how different gameplay situations might work out. So I see a lot of potential for models like this to support creative ideation in many different application domains. Over time, we will understand how we can add additional modalities to this kind of material for creators. And I think we’re only at the beginning of exploring what’s possible and what new kinds of things people will come up [with] when they have this new material at their disposal.

LANGFORD: So just relating to Hoifung’s, it seems like Hoifung probably deals with a lot more data access– and incomplete data–type issues than you may be dealing with, but then there’s also, kind of, a creative dimension of things never tried before, which you would like to address that maybe is not so important in Hoifung’s case.

HOFMANN: That’s a really good reflection. I think this already points to some of the key both challenges and opportunities in the space. One opportunity is just the fact that Hoifung and my work is, in many ways, so similar; it’s really quite striking. We’ve seen a, kind of, confluence of models in the sense of we have a really, really good understanding of some of the things that work really well, especially when we have the right data and when we know how to scale them up, but there are also some fundamental questions on, how do we deal with partial incomplete data? How do we curate and create datasets? How do we understand the impact of the variety, quality, and scale of data that we have at our disposal on the ultimate quality of the models that we can create? In our case, like you say, we build on, kind of, the rich data that we can obtain from game environments. And that also means that we can inform some of this research. We can build insights on how to best use this kind of data that might actually benefit Hoifung and his team in building better models, ultimately for precision medicine, which, I find incredibly exciting. And then there are dimensions where we really look for very different things. Precision, of course, accurate results is very, very important for any application in the health space. Whereas in our case, for example, capturing the full diversity of what might be possible in a given game situation or pushing on boundaries, creatively recombining elements, is something that we might be looking for that may be much less desirable in other applications.

LANGFORD: Thank you. Jianwei, can you tell us about your multimodal models?

JIANWEI YANG: Yeah, yeah. Yeah, so hi, everyone. I’m very glad to be here to discuss about the multimodal models. So my background is more from computer vision. So I started my computer vision journey roughly maybe 10 years ago. Now actually, we are mainly talking about the multimodal model, so typically, actually, the multimodal model actually covering about, for example, the vision and the language and how we can combine them together. In my opinion, I think at high level, this kind of multimodal model actually can really, OK, help in a lot of, kind of, applications or a lot of, kind of, scenarios in that it can really help to capture the waters around us. So it have the visual input, and it can capture what kind of object and what kind of relationship and what kind of action in the image or videos, etc. On the other hand, actually, this kind of multimodal model, by connecting this vision and the language, can really help the model have the communication capability with humans, so that humans can really have a conversation and can chat with the model and then prompt the model to really finish some task that the human or the user is required to do. So overall, I feel there are a bunch of applications in these kind of multimodal scenarios. So from my point of view, so at a high level, it can be applied to not only the digital world, as Hoifung and Katja just mentioned, in their health and the gaming scenario, but also it can be applied to the physical world, right. So if we really have a multimodal system or AI agent that can really, OK, understand the whole environment, the physical world, and then have a very good communication capability, actually it can be deployed to, for example, autonomous driving system or even a real robot, right, so that we can really have a very good, kind of, copilot or something like that to help us to do a lot of daily tasks. This is quite exciting domain, but also, actually, we are still just at the beginning of this journey.

LANGFORD: So relative to what Katja and Hoifung talked about, are you thinking about more general-purpose multimodal models, or are you thinking about individual special case ones for each of these individual applications?

YANG: Yeah, I think that’s a very good question. So I think all of these applications also actually share some, kind of, basic rules. So in terms of the model building, actually we really need to care about the data, care about the modeling. So I will roughly talk about the modeling part. To really, OK, have a capable multimodal model, we need to encode different information from different modalities, for example, from vision, from language, from even audio speech, etc. So we need to develop a very capable encoder for each of these domains and then how to tokenize each of these raw data. So the pixel is raw data, the speech is raw data, and then we need to develop some good model to tokenize each of the modalities so that we can project the world of this modality into the same space and model the interaction across different modality data so that it can be really used to accomplish some complicated task for each of the domain. In my opinion, I think we have shared a lot of, kind of, common interest across different applications. For example, in our team, actually, we have been doing a lot of research toward the general-purpose multimodal system, and in the meanwhile, actually, we have great collaboration with Hoifung’s team to deliver some kind of domain-specific model, like LLaVA-Med, like BiomedJourney, etc., for the conversational medical AI system and also for the medical image generation and editing or prediction. So all of these are, kind of, sharing some, kind of, basic component in terms of modeling.

LANGFORD: All right, thank you. Maybe a question to, kind of, sharpen the discussion here is, what is, sort of, the top one multimodal challenge that you guys are running into? Hoifung, maybe you can start.

POON: Sure. Happy to. First, I want to echo Katja and Jianwei. So I think one of the really exciting things, I think, is that, sort of, like, a lot of the commonality, right, in a lot of this work. I think that also speaks to a more, kind of, general trend in AI that people, kind of, call it sometimes, like, great consolidation in AI. So, for example, I come from NLP background; Jianwei come from computer vision background, right. We could be friends in the past, but we probably rarely share a lot of the actual, kind of, modeling techniques in the past. But these days, a lot of the underpinning across this different modality, right, are actually quite common, for example, powered by transformer—at least that’s one of the prominent paradigms, right. So that really opened up a lot of the, like, cross-pollination as Jianwei and Katja are alluded to, right. Because you can use transformer to model imaging; you can model video, model text, model protein sequences, and all that. So that’s a super, super exciting thing, and that’s what you see across the board is, like, you see some breakthrough in one of the modality, and often very quickly, some of that can translate to some other, right. Now, back to your earlier question, I think, and one thing—as you alluded to earlier, right, John—is that in biomedicine, I think there are some specific challenges. No. 1 is actually, sort of, like, a lot of this obviously are very high stake, so we have to take our most, kind of, care about stuff like privacy, compliance. For example, all of our collaboration, the data and all the compute actually happen in our partner, like, for example, health system is strictly within the at tenant, and we work within the a tenant to collaborate with them so make sure that all these are super buttoned up. There are some immediate, sort of, like, primatic challenges, and then also, if you look at, sort of, like, across the board, there is some infrastructure challenges, so, like, for example, before the day when a lot of those data were actually on the cloud, it’s very difficult, right, to do a lot of the on-prem computing. But nowadays, because a lot of data start to, you know, getting into the cloud, that makes a lot of this kind of AI progress a lot easier to apply. And then from a specific, kind of, like, modeling challenges, we benefit a ton from the, sort of, like, general-domain progress, first and foremost, right. So a lot of our, kind of, basic modeling actually … that’s why we build on a lot from Jianwei and his team’s great work, right. However, also, when you look at the biomedicine, there are also very specific challenges. Like, I will start from an individual modality, right. So, for example, if you look at, like, the current frontier model, let’s say, you know, GPT-4 and so forth, they are actually really, really good at reasoning and understanding biomedical texts, right. But once you go beyond, sort of, like, go to the non-text modality—once you look at CT, MRI, digit pathology, OMX, and so forth—then those frontier models, this exposes their competency gap. And the challenge is that biomedical text, actually, there is a ton on the public web, right, that GPT-4 was able to consume. Like, PubMed alone have 32 million biomedical papers, right. But when you think about, like, multimodal longitudinal patient data, those are not really, you know, exists a ton in the public web. So that creates lots and lots of, kind of, competency gap. There are lots and lots of unique challenges. For example, take digital pathology as an example, right. A pathology whole slide images could contain billions and billions of pixels, and it could be hundreds and thousands of times larger than a typical web image. And that means, like, standard ways to use transformer completely blow up, right, because you probably need quadratic, which means like billions of times of compute for a single image, right. And CT and MRI are not 2D images, and they are also very, very different from 3D video [that] normally you would find in the general domain. So all those, kind of, like, present very exciting challenges for individual modality, so that create a lot of exciting research opportunity by saying, hey, what are some of the modality-specific inductive biases, right, that we can harness to do, you know, modality-specific self-supervision pretraining, and so forth, right. And so we can do effectively dimension reduction in individual modality. But once we do that, there was still a big, big challenge, as Jianwei alluded to earlier, which is that, for example, think about, right, like a tumor lesion in a CT image may be mapped to a very, kind of, different space in the embedding space compared to the tumor lesion in digital pathology because they are, kind of, like, independently self-supervised pretraining. So then came, actually, another—and I would say even much bigger—challenge, is, like, how do we handle this kind of multimodal complexity? So, as Jianwei alluded to, how can we ensure that those tokenization for individual modality, they are actually aligned, and so now you can actually put them together to start doing effective multimodal reasoning, right? So, like, from an NLP background, you can think about this as a little bit like in translation. So the world has hundreds if not thousands of languages, and one of the key challenges is how can we enable, sort of, like, communication across those languages, and one of the very effective approach, right, that the machine translation community have come up with is this idea of introducing a resource-rich language as an Interlingua. So, for example, if I have a language from Africa versus a sublanguage in India, then maybe pretty much there’s, like, zero parallelized data between them, right. So I don’t know how to translate between them. But if I can actually learn to translate that African language into English and then translate from English to that sublanguage in India, then we would be able to successfully bridge the two languages. And in multimodal, actually we see an emerging opportunity to do the same thing, and we basically are using the text modality, right, as the Interlingua. The reason is that for any modality under the sun, the study of that modality typically involves a natural language, so when you get an image like a radiology image, you often have accompanying radiology report. When you have a digital pathology slide, you usually have a pathology report, right. So then using that language as a, sort of, a natural language supervision to ground the embedding, to nudge the embedding space on individual modality to the, kind of, like, the common semantic space represented by the text modality. And also, in this way, we can also capitalize on the dramatic investment in the frontier model that are very, very good on the text modality, right. So then if we can make sure that all the modality roughly align their common concept into the text semantic space, then that can make the reasoning much, much more, kind of, like, easy. And a lot of this actually … like, for example, Jianwei  mentioned about the LLaVA-Med work, right. That really, sort of, [was] inspired by the general-domain LLaVA work in vision-language modeling. But then we can actually generalize that into biomedicine to start using that modality-text pair to ground the individual modality.

LANGFORD: So a couple of comments. One of them is, what you described at the beginning, I’ve heard described as the “great convergence.” So it used to be that there were vision conferences and NLP conferences and machine learning conferences—and there still are. But at the same time, suddenly, what’s going on in these other conferences is very relevant to everybody. And so it’s interesting. … Suddenly, lots of more things are relevant than they were previously. And then amongst the different challenges you mentioned, I’m going to pick, I think, on the competency gap. Because it seems like it’s an interesting one that, kind of, applies potentially across many different folks. I think that the high-stakes nature of medical situations is also a very important one but specific to the medical domain. So the competency gap I’ve seen described elsewhere, and there’s been a number of studies and papers. That’s an interesting challenge, definitely. Katja?

HOFMANN: Can you remind me what you mean by competency gap? I missed that maybe …

LANGFORD: The competency of these models in the vision modality is not as good as it is typically in a text modality. In a text modality, they’re pretty good in a lot of situations. But in the vision modality, there are simple things like, you know, what’s the relationship between this object and that object in the picture where these models can just not really succeed.

POON: Yeah, so specifically, Katja, right, like, for example, first, as John alluded to, right, for things beyond text, let’s say image and speech, right, and even in the general domain, there may be, sort of, like, already some challenges that could actually certainly have growth space. But once you go to a vertical space like biomedicine, then those competency gaps are actually much, much more pronounced, right. So if you ask some of the state-of-the-art image generator to say, hey, draw me a lung CT scan, they will actually draw you a glowing lung. [LAUGHS] It has no idea what is a CT scan. And across the board, you can see this. Like, there may be some particular classification task or something where some of those data are actually in public, so there may be some academic datasets and so forth. In this case, the frontier model may have seen them and been exposed to them, so they may have been able to internalize some of that, right. So if you ask some of the really top frontier models, they at least have some idea, like, oh, this should be an x-ray; this should be a CT scan; this should be … . And sometimes, maybe even some detailed information, maybe also have some idea. But once you go really, really deep into those like, hey, what is tumor lymphocyte infiltration, right? They have no idea, right. You can argue also, like, it’s very difficult to wait for the frontier model to, let’s say, wait for it to actually parse all the quality tokens in the public web because a lot of those kind of multimodal patient data are simply nonexistent in the public web for a good reason, right. So that basically put a very, kind of, important responsibility for us to figure out what could be a scalable AI methodology for us to quickly, efficiently bridge those competency gaps for individual modality and also for enabling combination, synthesizing of them, as Jianwei alluded to earlier.

LANGFORD: So, Katja, what’s your top one challenge?

HOFMANN: There are many, many challenges. I found myself agreeing with Hoifung on many of them, and I was going through a couple of reactions. So let me start with thoughts on the competency gap and also that Interlingua, whether natural language is going to be that, I don’t know, shared modality connecting across all of them. I was just, as you were speaking, thinking through that it’s weird in the sense that some of our key insights in this space come from large language models. So really a model that is started … because that data was most readily available maybe, we have a lot of our insights from specifically language, which, of course, in our own human experience, doesn’t come first. We experience the world through vision, touch, and all our other senses before we start to make sense of any of the language that is spoken around us. So it’s really, really interesting to think through the implications of that, and potentially, as we start to understand more about the different modalities that we can model and the different ways in which we combine them, some of those initial insights may no longer hold. In our examples of gameplay data restored from visuals and controllers, we get this really highly consistent model that is able to generate 3D space, how characters interact with the space, how game mechanics work, cause and effect, and things like that, which is something that is quite different from what has been observed in many other models. So I think there is quite a lot of hope, kind of, that we will be able to arrive at models that have a much more holistic understanding of how the world works once they no longer rely on language as their primary sense, for example. And because of that, I’m also not sure whether indeed natural language will be that shared underlying thing that unites modalities. There might be other, better choices for that because language … it is abstract. It can capture so many things, but it might not be the most efficient way of communicating a lot of the things that we can grasp in other modalities. I’m really, really curious where that is heading. But coming back to your question, John, on what I see as the biggest challenge. It feels like we’re at such an early stage in this field. I could map out tens, hundreds of questions that are really, really interesting, starting from representations and data to model architectures to the way in which different model components might be trained separately and might interact in, kind of, societies of models or agents, if you will. But what I’d like to bring this back to is that for me, personally, the way we ultimately use those models, the way we understand how they can fit within our, kind of, human practices, to really make sure that when we build these models and they have the potential to change the way we do our work or play games or interact with each other, that we should be focusing on making sure that they are as useful, as empowering to their users as possible. So I think we could do a lot more in understanding not just purely what we can build but also what is the need. What are the kinds of capabilities we need to imbue our models with so that we could really drive progress in the right direction and make sure they empower as many people as possible?

LANGFORD: So I’m getting two things from what you’re saying. One of them is there’s a little bit of disagreement about what the ideal, long-term Interlingua is between the different modalities. One of them could be language; one of them could be more of a vector space representation–type thing.

HOFMANN: Yup.

LANGFORD: And then the challenge that you’re pointing out is understanding how to apply and use these models in ways which are truly useful to people.

HOFMANN: And not just apply and use the models we already have, but how do we inform that next wave of model capabilities, right? We’ll train models on whatever data we have. But then there’s also the more maybe demand-driven side of in order to unlock this next generation of applications, what do we need and how do we systematically go about then putting the right pieces together? For example, collecting the data that is needed to drive that kind of innovation.

LANGFORD: Jianwei, your turn. What is the primary challenge you see with multimodal models?

YANG: Yeah, I think there are some great, kind of, debates in the whole domain regarding, OK, the vision … whether we should take vision as a core or take language as a core. I want to share two cents about the discussion Katja and Hoifung just made. I still remember at the very beginning of the deep learning era, actually, when people [were] talking about deep learning, usually they mentioned probably ResNet or AlexNet or ImageNet, this kind of vision-centric benchmark or model. But right now, actually, when people talk about deep learning, talk about the AI, usually we mention, OK, the large language model, etc. You can see, OK, there’s some kind of transition from vision to language. I think this also implies some kind of challenges or some kind of, you know, methodology transition. So at the very beginning, actually, all of these different modality, the researchers are trying to, OK, collect some labeled data and train some supervised model to classify the image, classify the text, or etc. But later on, actually, people—especially in the language domain—actually come up with a new idea of self-supervised learning, so that, OK, the model can really learn from a huge amount of unlabeled data, and this huge amount of unlabeled data actually are existing already there from the internet, and people can create a lot of text data, and by nature, another benefit of this language data is that, OK, it can be naturally or natively tokenized. It’s not, like, OK, the image or speech, actually, we cannot handle the hundreds of millions of pixels in a single image or in a single video. But for language, actually, it’s pretty compact, as Katja just mentioned, it’s pretty compact and also representative so that, OK, people can easily tokenize them, convert the language into some, kind of, discrete IDs, right, and then encode this discrete levels, discrete IDs, in the feature space. So I can feel that, OK, so now the vision is lagging behind the language in general. Even though I’m from the vision side, actually, I can see, OK, a gap is there in terms of how we can really build a self-supervised learning system to learn the visual representation which can match the power of the language representations so that we can really merge or bridge the gap as, John, you just mentioned, right. So different modality definitely have the gap. But the vision side actually lag behind, and we need to handle that. And talking about this, I think one of the big challenges to build a very capable multimodal model is [dependent on] how we can bridge the gap by, OK, bringing the vision representation, bringing the vision tokenizer  up to the similar level of the language representation and the language tokenizer so that we can really have a very good, kind of, intimate interaction across two different modality. One last point I want to comment is that, OK, so whether the whole system should be language native or vision native or in any other modality, I think this also depends on what kind of applications. For some of the applications, actually, for example, in Hoifung’s domain, the Health Futures, actually, people need to handle a lot of document, but on the other hand, actually, in, Katja, your gaming domain, actually, the model need to handle a lot of pixels and handle a lot of, kind of, reasoning or temporal, kind of, planning, etc. So in my opinion, I think this really depends on what kind of scenario we are really handling. Some of the tasks actually need a lot of, kind of, good representation of the language because it’s language heavy. But for some other tasks like autonomous diving or robotics or planning or visual planning, etc., actually, it’s more rely on whether the model really understand the visual signals from pixel to the middle level, kind of, segmentation to the high-level interaction across different objects in the environment or in the physical world. So I can feel there are still some, kind of, difference, and it’s still not merged, and this is, I think, why I feel it is very exciting to further push forward in this domain, yeah.

LANGFORD: So I think what I’m getting from you is the capability gap is, kind of, a key concern, as well, is that right?

YANG: Yup, yup.

LANGFORD: OK, so capability gap, capability gap, and understanding how to really use these models in ways which are truly beneficial to people. All right, so maybe given this and given the other discussions, what are your predictions for closing these gaps and figuring out how to truly use these models in effective ways? Do you have thoughts for where we’re going to be a year from now, three years? Even three years is, kind of, far at this point. Five years? I think that trying to predict the future is one of these things where you always feel a little squirmy because it’s hard. But at the same time, I think it could be really valuable to get your sense of where things are going to go and how fast. Hoifung?

POON: Yeah, so first, I want to, sort of, kind of, echo what Katja and Jianwei point out, right. I actually don’t see it as a disagreement because I think, actually, all of us are talking about the same thing, which is that all those modality, we want to embed them into a common semantic space, right. So all of them are vector representations, right. And, for example, even when I am mentioning about this text as an interlingual, it doesn’t mean that we are actually trying to map those images and put in sequences literally to text tokens, right. So that could be one way you can do it. But, actually, the much more general way is using the text signals. And in fact, the texts, actually, are also mapped to the embedding space before all this, kind of, modeling happened, like, for example, in LLaVA-Med, right. So the goal, I would say, the end goal, is the same. The challenges for, sort of, like, just looking at … like, for example, a very common paradigm is contrastive learning, and a great example is CLICK, right, and that handles two modalities by saying, let’s push those two modalities into a common semantic space. That works fine when you just have two modalities. Once you have three, [LAUGHS] you start to get a three-body problem, and once you have four and five and six, and that’s where actually the combinatorial explosion happens, and that’s actually why we, sort of, like, have to motivate us to say, hey, can we reduce the combinatorial explosion to maybe a linear, right, like, alignment to some, potentially, some, like, pretty well-developed, kind of, semantic space, right. And texts, I think, for one aspect, as Jianwei alluded to, has some advantage in the sense that if you look at the public web, right, so texts are not just a bunch of words, but they actually capture a lot of human knowledge inside, right. So, for example, if we can, sort of, like, have a gene like EGFR, it may not mean anything, right, to folks who don’t study genomic, but for folks who study genomic, then EGFR actually is a very important gene, and everybody immediately conjure up, oh, this is connected to lung cancer because a lot of lung cancer are caused by mutation in the EGFR gene, right. You also can easily conjure up the sequence for the EGFR. You can conjure up knowledge about how the protein encoded by the EGFR actually bind to other, kind of, protein and what kind of pathway or function do they control. So all this, kind of, knowledge are actually captured not just by this single modality of the sequence or anything but actually by mapping to the text semantic space. There is one to one, so you can immediately have access to all those, kind of, vast amount of knowledge, right. And also, I want to quickly, sort of, acknowledge what, Jianwei, you mentioned about it’s fascinating, like, to think about the self-supervision landscape in computer vision versus NLP, right. Early days as an NLP person, I’m very jealous about vision because you have so many translation invariance, right, rotation invariance, so you can use all those kinds of things to, you know, create synthetic training data. And so until the day in NLP, when people figured out, like, masked language [modeling], when you start to play hide and seek with the words, then that become also a very powerful self-supervision. If we directly apply that heuristic to computer vision to try to match a patch or pixel, that’s not as effective, right, at language so far. So exactly to your point. It’s fascinating to think about different kind of modality, different kind of inductive biases. But I would say fundamentally they all boil down to, I mean, John, you know this all too well, right. It’s like what, kind of, sort of, like, training data, sort of, like, general sense, right, are available. So the single modality data, unannotated images, and so forth, they are most abundant, right, so then there are all this, kind of, exciting work to how to harness them. I would also argue that the text is not a perfect Interlingua, but you can think about them as a second, sort of, source of free lunch because all this modality, often they have some accompanying text associated. But I would be the first to point out that that’s far from being complete, right. For example, think about, like, five years ago, we don’t have the vocabulary called COVID, right, even though that virus molecule could exist in some imaginary space already, right. Even today—as, Katja, you alluded to—if you look at, for example, a radiology report, it doesn’t capture every single thing captured in the images, right. In fact, even the best radiologists may not even agree with themselves … written six months ago about the same interpretation, the same report, right. So I think some of the fascinating potential would be, can we actually also ground it with obviously the long-term outcome, right? For example, the patient six months later, did the cancer recur or not? That’s much less ambiguous compared to the signal that is immediately available for the modality. So that’s, kind of, like, some of my thoughts. But to directly answer your question, John, I roughly think about, sort of, like, a lot—like both Katja and Jianwei mentioned—about the really exciting prospect is that ultimately this is not even just an intellectual exercise. It can actually really bring huge benefit, right, potentially to real-world impact, right. And when I think about what kind of real-world impact, I can see a continuum between what I will loosely call productivity gain to creativity gain. And, John, to answer your question, I see already a lot of, kind of, like, high-value, low-risk opportunity where those are the places where typically it’s, like, human expert can perfectly do the task. They just could be very repetitive, could be very boring. [LAUGHS] But in this case, like, I think a lot of the multimodal GenAI already get to a point where they can already assist the human to actually do some of those tasks at scale, right. And also the beauty of that is exactly because human expert can perfectly do the task, this is very naturally you can do a human-in-the-loop setting. So then you can ensure accuracy and ensure that actually the AI … the error can be easily corrected, and then the error can feed back to improve the AI system, right. So some of the examples … for example, like, think about again, go back to a cancer patient. Unfortunately, oftentimes, for late-stage cancer patients, at least, a majority of the standard care doesn’t work, right. So that leaves, like, for example, clinical trial to be the last hope. But every year, for example, in US alone, 2 million new cancer patients, and then every single time, there are thousands of, you know, trials, right. If you go to CDC.gov, half a million trials. And so how do we actually match a trial to a patient? Today, they are basically hiring the so-called trial coordinators, right. Basically [they] manually try to look at the patient record, look at a trial, and that’s completely not scalable. But you can imagine basically, like, we can learn this patient embedding that captures all the important information about the patient, and you can also embed the trial, right, because trials actually are specified by this eligibility criteria, like what kind of property I want to see in the patient so that I will include in the trial. Then once you actually got that in the embedding space, then matching, you can do it 24/7, and you can do what people call just-in-time clinical trial matching. Actually, a few years ago, this is still a novel concept. Nowadays, actually this have already been, you know, applicable to many places. I just want to highlight one example. For example, we are very fortunate to collaborate with Providence, which is the fourth-largest health system in the US. They started using our AI research system actually daily now to help their tumor board to, actually, do this trial at scale. So examples like that actually become more and more available, right. Like, for example, with digital pathology, you can actually use generative AI to learn a very good model to say, classify the subtypes of the cancer, and so forth, and then have human expert to, you know, do [xxxx] checks and so forth. And so that’s already happening, right. And that’s actually even now, right, that lots of this are already happening. But looking forward, I think the most exciting prospect is … so loosely, I would call it creativity gain. That’s actually in the regime where even the best human expert has no idea how to do it. So, for example, like, digital pathology slide, you look at it, you can discern, here’s a tumor cell; here is the lymphocyte, or the immune cell; here are the normal cells. And looking at the configuration, right, it actually gives you lots of clue about whether the immune system already alerted by the cancer, and thereby, it can determine to a large degree whether the immunotherapy can work, right. But right now, pathology, even the best pathology, they only have a very weak heuristic about, OK, let me count how many lymphocyte within the tumor region. So using that as a very rough proxy, it can do somewhat better than, you know, right now if you don’t look at it at all. But arguably, there could be tons and tons of subtle patterns that actually even the best human experts have no idea about today, but we could potentially have generative AI to potential analog. I think that would be the most exciting prospect.

LANGFORD: OK, so human-in-the-loop generative AI visual pathology, for cancers and immunotherapy. How long is that? One, three, five years?

POON: I would say that there was some immediate application, let’s say, to pick out, like, classify one of the subtype of the cancer. Right now, like, the best generative AI models are already doing real well. So, for example, we recently have a Nature paper …

LANGFORD: How long until it’s deployed?

POON: So right now, for example, at Providence, they are already looking into actually using that as part of the workflow to say, hey, can we use that to help the subtype …

LANGFORD: So within a year.

POON: Yeah, and also, there are some additional applications, for example, like, predicting mutation, right, and then that could be solved by uncovering some of the underlying, kind of, genetic activities and so forth. But I would say the holy grail here would be to actually go from that 30 percent response rate for immunotherapy, right, to, you know, much, much higher, right. So that one, I think there has already been some study, but usually on a smaller scale, right, of patient and also hasn’t actually really incorporated a lot of the really state-of-the-art, kind of, generative AI technique, right. Right now, I think the really exciting prospect is, like, for example, one science paper two years ago, right, had actually 200 patients, and if you don’t have a very good representation of the digital pathology slide or including the general patient embedding, then 200 data points doesn’t give you that much signal, right. But on the other hand, if you can learn from billions and billions of, you know, pathology images from millions of slides, right, then you can learn a very good representation. Then building on top of that kind of, like, representation, you could actually start to learn much, much more efficiently from the long-term outcome, right. And I would say the first stage of, like, the productivity gain … like, I also want to highlight, for example, there were exciting partnership, like, some of our sibling team with Paige AI, right. So they have actually also demonstrated, like, for example, that you can do clinical gray. It’s, kind of, like classification for certain tasks today, right. But the holy grail would be to, you know, go to this, kind of, like, modeling tumor microenvironments and actually predicting immunotherapy outcome. I can’t really predict when will we get to that above 90 percent of the response rate prediction, but I think there is a huge growth area. Conceptually, we already see some tiny bit of promising results, even by using very, sort of, low-dimension features and smaller number of data points. So I’m pretty optimistic that if we can scale the data but also crucially scale and improve the representation learning, then we can get much, much better results.

LANGFORD: That’s great. Katja, your predictions for the future?

HOFMANN: Let’s see. One, three, five years? I’m very optimistic that in the shorter term—maybe I have a biased view—data from games can be very, very influential in helping us answer some of those fundamental questions on how to train and use multimodal models. The, kind of, degree to which we can collect rich data, multimodal data, in a reasonably controllable way at reasonable scale just makes these kinds of environments a prime candidate for driving insights that now, in this new world, are much more generally applicable than they might have been in the past. So I would say we’ll definitely see the benefits of that in the coming year. Within the three-year horizon, one thing that’s really interesting and also, kind of, connects to my journey as a researcher is that over the past 10 or so years, I’ve invested a lot of my effort—and my team has invested a lot—in things like reinforcement learning and decision-making. And so now with generative models, we’ve primarily seen the power of predictive model trained at massive scale. Yes, there are some RLHF [reinforcement learning from human feedback] around the edges, but I think we will see a return to some of those fundamental questions around reinforcement and decision-making settings. But building on top of the insights that we’ve gained from training these large-scale multimodal generative models. And I would say that in around three years, that’ll be back in full force, and we’ll have a lot of fun benefiting from maybe some of those, kind of, further back insights that maybe, John, you and our teams have developed over time. So I’m very much looking forward to that. Five-year horizon? You’re right. It’s hard to predict that, especially as things seem to be moving so quickly. But one thing that I would expect in that time frame is that some of the innovations that we are working on today will make their way into the hands of actual end users, and so I believe that starting to understand what that new human-AI collaboration paradigm could look like, that is something that everyone with computer access would be able to experience within the next five years.

LANGFORD: OK, excellent. Jianwei?

YANG: Yeah, honestly, I always think it’s very hard to predict the future in such a five-year horizon, so I want to share an old Chinese saying. So it says, OK, so “read 10,000 book and walk 10,000 miles.” I’ll say in the past maybe five years since GPT or since the transformer era coming, so basically, OK, we have been striving for making the model to read thousands of books or hundreds of thousands of books, right, so from the internet, from the Wikipedia, etc. So the model itself right now, like GPT-4 or many other, kind of, open-source models, actually already got a lot of knowledge from the textbooks. They have a very good understanding about how the world actually is operating, right. But the problem is that, OK, this knowledge are not grounded at all. This knowledge is great, but this knowledge actually are not grounded on any kind of observation, any kind of digital world or physical world. In my opinion, I think in the next few years … even actually is already happening now. People are trying to ground this big brain learned by reading a lot of books, right, onto this digital world or physical world represented by, OK, image, by video, or by many other modalities, as Hoifung just mentioned. So I can imagine, OK, in the next one or two years, actually, people are really trying to squeeze the knowledge out from the big, or giant, large language model to really, OK, build the connection between this, kind of, heavy and rich knowledge and the visual observations or other type of observations. So this is one thing I can imagine, which would be very likely happening very soon. People are trying to build the connection, and the people are trying to really, OK, make the other part of the model stronger and have some kind of connector in between. After that, I can imagine that, OK, so this kind of progress will probably start happening in the digital world because, OK, as we just mentioned, in the digital world actually, people can obtain the data very quickly. People can create a lot of data from the internet. Gradually, actually if we used up all of the data from the internet actually, we really need to put this system into the physical world. Into the environment that we are living, right. We really want to put this model out, put this system out, and let themselves actually explore in the physical world to interact with the environment and to learn some more knowledges. I want to echo, Katja, what you just said actually. The gaming environment, I would say, is a really great, kind of, simulator or emulator to the real environment. There are a lot of interactions between the agent and the environment, right. So I think this kind of information can really be beneficial to make the model to learn more grounded knowledge about the world. Later on, actually, we can probably really deliver this model, this system … recently, our team actually, we are doing some, kind of, research like, OK, how we can convert the multimodal language model like LLaVA or Phi-3-V to a vision-language-action model. It already learned a good knowledge about vision and language demand but how we can make it more capable of making a decision, make the planning, to accomplish some daily task, like what we are doing in our daily life. This is, kind of, saying, actually, I can imagine it could happen very soon, in maybe a few years. I feel it’s very exciting area for us to really push forward this direction. Also, I’m very, kind of, optimistic that we are seeing very great things in the next few years.

LANGFORD: So do you have timelines for these? What’s your expectation for how long it’ll take?

YANG: Yeah, so talking about the ultimate goal in my mind, like I just mentioned, so how we can build a real AI agent that can traverse in the digital world and also traverse in the physical world. We already see a lot of work in the digital world, like the system we have built. On the other hand, actually, in the physical world, actually, we’re yet to see a real, kind of, robot that can undertake a daily task like ourselves. I can imagine this will require a lot of effort from different aspect. But I’m a little bit probably more optimistic that, OK, it could happen in maybe five years to 10 years, that, OK, we really can buy some kind of real robot, right, and put it in a home and help us to do some household tasks, something like that. I’m probably, yeah, a little bit optimistic, yeah.

LANGFORD: So five to10 years until we have a generative AI robot that can do useful things …

YANG: Yeah, yeah …

LANGFORD:in your home. Excellent. All right, I think we’ve probably gone longer than they wanted us to. Thank you, everyone. It’s great to hear from you. I’ve actually learned quite a bit during this discussion.

POON: Thanks so much.

HOFMANN: I really enjoyed the discussion. Thanks so much everyone.

YANG: Yeah, thank you for the invitation.