Microsoft Research

Ideas: The journey to DNA data storage

Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) — Tue, 19 Nov 2024 14:00:00 +0000

Behind every emerging technology is a great idea propelling it forward. In the Microsoft Research Podcast series Ideas, members of the research community at Microsoft discuss the beliefs that animate their research, the experiences and thinkers that inform it, and the positive human impact it targets.

Accommodating the increasing amounts of digital data the world is producing requires out-of-the-box thinking. In this episode, guest host Karin Strauss, an innovation strategist and senior principal research manager at Microsoft, brings together members of her team to explore a more sustainable, more cost-effective alternative for archival data storage: synthetic DNA. Strauss, Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin discuss how Microsoft Research’s contributions have helped bring “science fiction,” as Strauss describes it, closer to reality, including its role in establishing the DNA Data Storage Alliance to foster collaboration in developing the technology and to establish specifications for interoperability. They also talk about the scope of collaboration with other fields, such as the life sciences and electrical and mechanical engineering, and the coding theory behind the project, including the group’s most powerful algorithm for DNA error correction, Trellis BMA, which is now open source.

Get the Trellis BMA code

Learn more:

Trellis BMA: coded trace reconstruction on IDS channels for DNA storage
Publication, July 2021
Evaluating the risk of data loss due to particle radiation damage in a DNA data storage system | Nature Communications
Publication, September 2024
DNA Data Storage Alliance (opens in new tab)
Alliance homepage
Architecting Datacenters for Sustainability: Greener Data Storage using Synthetic DNA
Publication, September 2021
Microsoft and UW demonstrate first fully automated DNA data storage
Video, March 2019
Storing digital data in synthetic DNA with Dr. Karin Strauss
Microsoft Research Podcast, October 2018
Microsoft and University of Washington DNA Storage Research Project (opens in new tab)
Video, July 2016

Transcript

[TEASER]

[MUSIC PLAYS UNDER DIALOGUE]

JAKE SMITH: This really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media.

BICHLIEN NGUYEN: It’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go.

SERGEY YEKHANIN: We all work for Microsoft; we are all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.

SERIES INTRO: You’re listening to Ideas, a Microsoft Research Podcast that dives deep into the world of technology research and the profound questions behind the code. In this series, we’ll explore the technologies that are shaping our future and the big ideas that propel them forward.

[MUSIC FADES]

GUEST HOST KARIN STRAUSS: I’m your guest host Karin Strauss, a senior principal research manager at Microsoft. For nearly a decade, my colleagues and I—along with a fantastic and talented group of collaborators from academia and industry—have been working together to help close the data creation–data storage gap. We’re producing far more digital information than we can possibly store. One solution we’ve explored uses synthetic DNA as a medium, and over the years, we’ve contributed to steady and promising progress in the area. We’ve helped push the boundaries of how much DNA writer can simultaneously store, shown that full automation is possible, and helped create an ecosystem for the commercial success of DNA data storage. And just this week, we’ve made one of our most advanced tools for encoding and decoding data in DNA open source. Joining me today to discuss the state of DNA data storage and some of our contributions are several members of the DNA Data Storage Project at Microsoft Research: Principal Researcher Bichlien Nguyen, Senior Researcher Jake Smith, and Partner Research Manager Sergey Yekhanin. Bichlien, Jake, and Sergey, welcome to the podcast.

BICHLIEN NGUYEN: Thanks for having us, Karin.

SERGEY YEKHANIN: Thank you so much.

JAKE SMITH: Yes, thank you.

STRAUSS: So before getting into the details of DNA data storage and our work, I’d like to talk about the big idea behind the work and how we all got here. I’ve often described the DNA Data Storage Project as turning science fiction into reality. When we started the project in 2015, though, the idea of using DNA for archival storage was already out there and had been for over five decades. Still, when I talked about the work in the area, people were pretty skeptical in the beginning, and I heard things like, “Wow, why are you thinking about that? It’s so far off.” So, first, please share a bit of your research backgrounds and then how you came to work on this project. Where did you first encounter this idea, what do you remember about your initial impressions—or the impressions of others—and what made you want to get involved? Sergey, why don’t you start.

YEKHANIN: Thanks so much. So I’m a coding theorist by training, so, like, my core areas of research have been error-correcting codes and also computational complexity theory. So I joined the project probably, like, within half a year of the time that it was born, and thanks, Karin, for inviting me to join. So, like, that was roughly the time when I moved from a different lab, from the Silicon Valley lab in California to the Redmond lab, and actually, it just so happened that at that moment, I was thinking about what to do next. Like, in California, I was mostly working on coding for distributed storage, and when I joined here, that effort kept going. But I had some free cycles, and that was the moment when Karin came just to my office and told me about the project. So, indeed, initially, it did feel a lot like science fiction. Because, I mean, we are used to coding for digital storage media, like for magnetic storage media, and here, like, this is biology, and, like, why exactly these kind of molecules? There are so many different molecules. Like, why that? But honestly, like, I didn’t try to pretend to be a biologist and make conclusions about whether this is the right medium or the wrong medium. So I tried to look into these kinds of questions from a technical standpoint, and there was a lot of, kind of, deep, interesting coding questions, and that was the main attraction for me. At the same time, I wasn’t convinced that we will get as far as we actually got, and I wasn’t immediately convinced about the future of the field, but, kind of, just the depth and the richness of the, what I’ll call, technical problems, that’s what made it appealing for me, and I, kind of, enthusiastically joined. And also, I guess, the culture of the team. So, like, it did feel like a startup. Like, we all work for Microsoft; we’re all Microsoft researchers. Microsoft isn’t a startup. But that team, the team that drove the DNA Data Storage Project, it did feel like a startup, and it was something unusual and exciting for me.

NGUYEN: Oh, I love that, Sergey. So my background is in organic chemistry, and Karin had reached out to me, and I interviewed not knowing what Karin wanted. Actually … so I took the job kind of blind because I was like, “Hmm, Microsoft Research? … DNA biotech? …” I was very, very curious, and then when she told me that this project was about DNA data storage, I was like, this is a crazy, crazy idea. I definitely was not sold on it, but I was like, well, look, I get to meet and work with so many interesting people from different backgrounds that, one, even if it doesn’t work out, I’m going to learn something, and, two, I think it could work, like it could work. And so I think that’s really what motivated me to join.

SMITH: The first thing that you think when you hear about we’re going to take what is our hard drive and we’re going to turn that into DNA is that this is nuts. But, you know, it didn’t take very long after that. I come from a chemistry, biotech-type background where I’ve been working on designing drugs, and there, DNA is this thing off in the nethers, you know. You look at it every now and then to see what information it can tell you about, you know, what maybe your drug might be hitting on the target side, and it’s, you know, that connection—that the DNA contains the information in the living systems, the DNA contains the information in our assays, and why could the DNA not contain the information that we, you know, think more about every day, that information that lives in our computers—as an extremely cool idea.

STRAUSS: Through our work, we’ve had years to wrap our heads around DNA data storage. But, Jake, could you tell us a little bit about how DNA data storage works and why we’re interested in looking into the technology?

SMITH: So you mentioned it earlier, Karin, that this really starts from the fundamental data production–data storage gap, where we produce way more data nowadays than we could ever have imagined years ago. And it’s more than we can practically store in magnetic media. This is a problem because, you know, we have data; we have recognized the value of data with the rise of large language models and these other big generative models. The data that we do produce, our video has gone from, you know, substantially small, down at 480 resolution, all the way up to things at 8K resolution that now take orders of magnitude more storage. And so we really need a denser medium on the other side to contain that. DNA is extremely dense. It holds far, far more information per unit volume, per unit mass than any storage media that we have available today. This, along with the fact that DNA is itself a relatively rugged molecule—it lives in our body; it lives outside our body for thousands and thousands of years if we, you know, leave it alone to do its thing—makes it a very attractive media, particularly compared to the traditional magnetic media, which has lower density and a much shorter lifetime on the, you know, scale of decades at most.

So how does DNA data storage actually work? Well, at a very high level, we start out in the digital domain, where we have our information represented as ones and zeros, and we need to convert that into a series of A’s, C’s, T’s, and G’s that we could then actually produce, and this is really the domain of Sergey. He’ll tell us much more about how this works later on. For now, let’s just assume we’ve done this. And now our information, you know, lives in the DNA base domain. It’s still in the digital world. It’s just represented as A’s, C’s, T’s, and G’s, and we now need to make this physical so that we can store it. This is accomplished through large-scale DNA synthesis. Once the DNA has been synthesized with the sequences that we specified, we need to store it. There’s a lot of ways we can think about storing it. Bichlien’s done great work looking at DNA encapsulation, as well as, you know, other more raw just DNA-on-glass-type techniques. And we’ve done some work looking at the susceptibility of DNA stored in this unencapsulated form to things like atmospheric humidity, to temperature changes and, most excitingly, to things like neutron radiation. So we’ve stored our data in this physical form, we’ve archived it, and coming back to it, likely many years in the future because the properties of DNA match up very well with archival storage, we need to convert it back into the digital domain. And this is done through a technique called DNA sequencing. What this does is it puts the molecules through some sort of machine, and on the other side of the machine, we get out, you know, a noisy representation of what the actual sequence of bases in the molecules were. We have one final step. We need to take this series of noisy sequences and convert it back into ones and zeros. Once we do this, we return to our original data and we’ve completed, let’s call it, one DNA data storage cycle.

STRAUSS: We’ll get into this in more detail later, but maybe, Sergey, we dig a little bit on encoding-decoding end of things and how DNA is different as a medium from other types of media.

YEKHANIN: Sure. So, like, I mean, coding is an important aspect of this whole idea of DNA data storage because we have to deal with errors—it’s a new medium—but talking about error-correcting codes in the context of DNA data storage, so, I mean, usually, like … what are error-correcting codes about? Like, on the very high level, right, I mean, you have some data—think of it as a binary string—you want to store it, but there are errors. So usually, like, in most, kind of, forms of media, the errors are bit flips. Like, you store a 0; you get a 1. Or you store a 1; you get a 0. So these are called substitution errors. The field of error-correcting codes, it started, like, in the 1950s, so, like, it’s 70 years old at least. So we, kind of, we understand how to deal with this kind of error reasonably well, so with substitution errors. In DNA data storage, the way you store your data is that given, like, some large amount of digital data, you have the freedom of choosing which short DNA molecules to generate. So in a DNA molecule, it’s a sequence of the bases A, G, C, and T, and you have the freedom to decide, like, which of the short molecules you need to generate, and then those molecules get stored, and then during the storage, some of them are lost; some of them can be damaged. There can be insertions and deletions of bases on every molecule. Like, we call them strands. So you need redundancy, and there are two forms of redundancy. There’s redundancy that goes across strands, and there is redundancy on the strand. And so, yeah, so, kind of, from the error-correcting side of things, like, we get to decide what kind of redundancy we want to introduce—across strands, on the strand—and then, like, we want to make sure that our encoding and decoding algorithms are efficient. So that’s the coding theory angle on the field.

NGUYEN: Yeah, and then, you know, from there, once you have that data encoded into DNA, the question is how do you make that data on a scale that’s compatible with digital data storage? And so that’s where a lot of the work came in for really automating the synthesis process and also the reading process, as well. So synthesis is what we consider the writing process of DNA data storage. And so, you know, we came up with some unique ideas there. We made a chip that enabled us to get to the densities that we needed. And then on the reading side, we used different sequencing technologies. And it was great to see that we could actually just, kind of, pull sequencing technologies off the shelf because people are so interested in reading biological DNA. So we explored the Illumina technologies and also Oxford Nanopore, which is a new technology coming in the horizon. And then preservation, too, because we have to make sure that the data that’s stored in the DNA doesn’t get damaged and that we can recover it using the error-correcting codes.

STRAUSS: Yeah, absolutely. And it’s clear that—and it’s also been our experience that—DNA data storage and projects like this require more than just a team of computer scientists. Bichlien, you’ve had the opportunity to collaborate with many people in all different disciplines. So do you want to talk a little bit about that? What kind of expertise, you know, other disciplines that are relevant to bringing DNA data storage to reality?

NGUYEN: Yeah, well, it’s such a futuristic technology, right? When you begin to work on the tech, you realize how many disciplines and domains you actually have to reach in and leverage. One concrete example is that in order to fabricate an electronic chip to synthesize DNA, we really had to pull in a lot of material science research because there’s different capabilities that are needed when trying to use liquid on a chip. We, you know, have to think about DNA data storage itself. And that’s a very different beast than, you know, the traditional storage mediums. And so we worked with teams who literally create, you know, these little tiny micro- or nanocapsules in glass and being able to store that there. It’s really interesting, this multidisciplinarity, because we’re, in a way, bridging software with wetware with hardware. And so you, kind of, need all the different disciplines to actually get you to where you need to go.

STRAUSS: Yeah, absolutely. And, you know, building on, you know, collaborators, I think one area that was super interesting, as well, and was pretty early on in the project was building that first end-to-end system that we collaborated with University of Washington, the Molecular Information Systems Lab there, to build. And really, at that point, you know, there had been work suggesting that DNA data storage was viable, but nobody had really shown an end-to-end system, from beginning to end, and in fact, my manager at the time, Doug Carmean, used to call it the “bubble gum and shoestring” system. But it was a crucial first step because it shows it was possible to really fully automate the process. And there have been several interesting challenges there in the system, but we noticed that one particularly challenging one was synthesis. That first system that we built was capable of storing the word “hello,” and that was all we could store. So it wasn’t a very high-capacity system. But in order to be able to store a lot more volumes of data instead of a simple word, we really needed much more advanced synthesis systems. And this is what both Bichlien and Jake ended up working on, so do you want to talk a little bit about that and the importance of that particular work?

SMITH: Yeah, absolutely. As you said, Karin, the amount of DNA that is required to store the massive amount of data we spoke about earlier is far beyond the amount of DNA that’s needed for any, air quotes, traditional applications of synthetic DNA, whether it’s your gene construction or it’s your primer synthesis or such. And so we really had to rethink how you make DNA at scale and think about how could this actually scale to meet the demand. And so Bichlien started out looking at a thing called a microelectrode array, where you have this big checkerboard of small individual reaction sites, and in each reaction site, we used electrochemistry in order to control base by base—A, C, T, or G by A, C, T, or G—the sequence that was growing at that particular reaction site. We got this down to the nanoscale. And so what this means practically is that on one of these chips, we could synthesize at any given time on the order of hundreds of millions of individual strands. So once we had the synthesis working with the traditional chemistry where you’re doing chemical synthesis—each base is added in using a mixture of chemicals that are added to the individual spots—they’re activated. But each coupling happens due to some energy you prestored in the synthesis of your reagents. And this makes the synthesis of those reagents costly and themselves a bottleneck. And so taking, you know, a look forward at what else was happening in the synthetic biology world, the, you know, next big word in DNA synthesis was and still is enzymatic synthesis, where rather than having to, you know, spend a lot of energy to chemically pre-activate reagents that will go in to make your actual DNA strands, we capitalize on nature’s synthetic robots—enzymes—to start with less-activated, less-expensive-to-get-to, cheaply-produced-through-natural-processes substrates, and we use the enzymes themselves, toggling their activity over each of the individual chips, or each of the individual spots on our checkerboard, to construct DNA strands. And so we got a little bit into this project. You know, we successfully showed that we could put down selectively one base at a given time. We hope that others will, kind of, take up the work that we’ve put out there, you know, particularly our wonderful collaborators at Ansa who helped us design the enzymatic system. And one day we will see, you know, a truly parallelized, in this fashion, enzymatic DNA system that can achieve the scales necessary.

NGUYEN: It’s interesting to note that even though it’s DNA and we’re still storing data in these DNA strands, chemical synthesis and enzymatic synthesis provide different errors that you see in the actual files, right, in the DNA files. And so I know that we talked to Sergey about how do we deal with these new types of errors and also the new capabilities that you can have, for example, if you don’t control base by base the DNA synthesis.

YEKHANIN: This whole field of DNA data storage, like, the technologies on the biology side are advancing rapidly, right. And there are different approaches to synthesis. There are different approaches to sequencing. And, presumably, the way the storage is actually done, like, is also progressing, right, and we had works on that. So there is, kind of, this very general, kind of, high-level error profile that you can say that these are the type of errors that you encounter in DNA data storage. Like, in DNA molecules—just the sequence of these bases, A, G, C, T, in maybe a length of, like, 200 or so and you store a very, very large number of them—the errors that you see is that some of these strands, kind of, will disappear. Some of these strings can be torn apart like, let’s say, in two pieces, maybe even more. And then on every strand, you also encounter these errors—insertions, deletions, substitutions—with different rates. Like, the likelihood of all kinds of these errors may differ very significantly across different technologies that you use on the biology side. And also there can be error bursts somehow. Maybe you can get an insertion of, I don’t know, 10 A’s, like, in a row, or you can lose, like, you know, 10 bases in a row. So if you don’t, kind of, quantify, like, what are the likelihoods of all these bad events happening, then I think this still, kind of, fits at least the majority of approaches to DNA data storage, maybe not exactly all of them, but it fits the majority. So when we design coding schemes, we are trying also, kind of, to look ahead in the sense that, like, we don’t know, like, in five years, like, how will these error profiles, how will it look like. So the technologies that we develop on the error-correction side, we try to keep them very flexible, so whether it’s enzymatic synthesis, whether it’s Nanopore technology, whether it’s Illumina technology that is being used, the error-correction algorithms would be able to adapt and would still be useful. But, I mean, this makes also coding aspect harder because, [LAUGHTER] kind of, you want to keep all this flexibility in mind.

STRAUSS: So, Sergey, we are at an interesting moment now because you’re open sourcing the Trellis BMA piece of code, right, that you published a few years ago. Can you talk a little bit about that specific problem of trace reconstruction and then the paper specifically and how it solves it?

YEKHANIN: Absolutely, yeah, so this Trellis BMA paper for that we are releasing the source code right now, this is, kind of, this is the latest in our sequence of publications on error-correction for DNA data storage. And I should say that, like, we already discussed that the project is, kind of, very interdisciplinary. So, like, we have experts from all kinds of fields. But really even within, like, within this coding theory, like, within computer science/information theory, coding theory, in our algorithms, we use ideas from very different branches. I mean, there are some core ideas from, like, core algorithm space, and I won’t go into these, but let me just focus, kind of, on two aspects. So when we just faced this problem of coding for DNA data storage and we were thinking about, OK, so how to exactly design the coding scheme and what are the algorithms that we’ll be using for error correction, so, I mean, we’re always studying the literature, and we came up on this problem called trace reconstruction that was pretty popular—I mean, somewhat popular, I would say—in computer science and in statistics. It didn’t have much motivation, but very strong mathematicians had been looking at it. And the problem is as follows. So, like, there is a long binary string picked at random, and then it’s transmitted over a deletion channel, so some bits—some zeros and some ones—at certain coordinates get deleted and you get to see, kind of, the shortened version of the string. But you get to see it multiple times. And the question is, like, how many times do you need to see it so that you can get a reasonably accurate estimate of the original string that was transmitted? So that was called trace reconstruction, and we took a lot of motivation—we took a lot of inspiration—from the problem, I would say, because really, in DNA data storage, if we think about a single strand, like, a single strand which is being stored, after we read it, we usually get multiple reads of this string. And, well, the errors there are not just deletions. There are insertions, substitutions, and, like, inversive errors, but still we could rely on this literature in computer science that already had some ideas. So there was an algorithm called BMA, Bitwise Majority Alignment. We extended it—we adopted it, kind of, for the needs of DNA data storage—and it became, kind of, one of the tools in our toolbox for error correction.

So we also started to use ideas from literature on electrical engineering, what are called convolutional error-correcting codes and a certain, kind of, class of algorithms for decoding errors in these convolutional error-correcting codes called, like, I mean, Trellis is the main data structure, like, Trellis-based algorithms for decoding convolutional codes, like, Viterbi algorithm or BCJR algorithm. Convolutional codes allow you to introduce redundancy on the string. So, like, with algorithms kind of similar to BMA, like, they were good for doing error correction when there was no redundancy on the strand itself. Like, when there is redundancy on the strand, kind of, we could do some things, but really it was very limited. With Trellis-based approaches, like, again inspired by the literature in electrical engineering, we had an approach to introduce redundancy on the strand, so that allowed us to have more powerful error-correction algorithms. And then in the end, we have this algorithm, which we call Trellis BMA, which, kind of, combines ideas from both fields. So it’s based on Trellis, but it’s also more efficient than standard Trellis-based algorithms because it uses ideas from BMA from computer science literature. So this is, kind of, this is a mix of these two approaches. And, yeah, that’s the paper that we wrote about three years ago. And now we’re open sourcing it. So it is the most powerful algorithm for DNA error correction that we developed in the group. We’re really happy that now we are making it publicly available so that anybody can experiment with the source code. Because, again, the field has expanded a lot, and now there are multiple groups around the globe that work just specifically on error correction apart from all other aspects, so, yeah, so we are really happy that it’s become publicly available to hopefully further advance the field.

STRAUSS: Yeah, absolutely, and I’m always amazed by, you know, how, it is really about building on other people’s work. Jake and Bichlien, you recently published a paper in Nature Communications. Can you tell us a little bit about what it was, what you exposed the DNA to, and what it was specifically about?

NGUYEN: Yeah. So that paper was on the effects of neutron radiation on DNA data storage. So, you know, when we started the DNA Data Storage Project, it was really a comparison, right, between the different storage medias that exist today. And one of the issues that have come up through the years of development of those technologies was, you know, hard errors and soft errors that were induced by radiation. So we wanted to know, does that maybe happen in DNA? We know that DNA, in humans at least, is affected by radiation from cosmic rays. And so that was really the motivation for this type of experiment. So what we did was we essentially took our DNA files and dried them and threw them in a neutron accelerator, which was fantastic. It was so exciting. That’s, kind of, the merge of, you know, sci fi with sci fi at the same time. [LAUGHS] It was fantastic. And we irradiated for over 80 million years—

STRAUSS: The equivalent of …

NGUYEN: The equivalent of 80 million years.

STRAUSS: Yes, because it’s a lot of radiation all at the same time, …

NGUYEN: It’s a lot of radiation …

STRAUSS: … and it’s accelerated radiation exposure?

NGUYEN: Yeah, I would say it’s accelerated aging with radiation. It’s an insane amount of radiation. And it was surprising that even though we irradiated our DNA files with that much radiation, there wasn’t that much damage. And that’s surprising because, you know, we know that humans, if we were to be irradiated like that, it would be disastrous. But in, you know, DNA, our files were able to be recovered with zero bit errors.

STRAUSS: And why that difference?

NGUYEN: Well, we think there’s a few reasons. One is that when you look at the interaction between a neutron and the actual elemental composition of DNA—which is basically carbons, oxygens, and hydrogens, maybe a phosphorus—the neutrons don’t interact with the DNA much. And if it did interact, we would have, for example, a strand break, which based on the error-correcting codes, we can recover from. So essentially, there’s not much … one, there’s not much interaction between neutrons and DNA, and second, we have error-correcting codes that would prevent any data loss.

STRAUSS: Awesome, so yeah, this is another milestone that contributes towards the technology becoming a reality. There are also other conditions that are needed for technology to be brought to the market. And one thing I’ve worked on is to, you know, create the DNA Data Storage Alliance; this is something Microsoft co-founded with Illumina, Twist Bioscience, and Western Digital. And the goal there was to essentially provide the right conditions for the technology to thrive commercially. We did bring together multiple universities and companies that were interested in the technology. And one thing that we’ve seen with storage technologies that’s been pretty important is standardization and making sure that the technology’s interoperable. And, you know, we’ve seen stalemate situations like Blu-ray and high-definition DVD, where, you know, really we couldn’t decide on a standard, and the technology, it took a while for the technology to be picked up, and the intent of the DNA Data Storage [Alliance] is to provide an ecosystem of companies, universities, groups interested in making sure that this time, it’s an interoperable technology from the get-go, and that increases the chances of commercial adoption. As a group, we often talk about how amazing it is to work for a company that empowers us to do this kind of research. And for me, one of Microsoft Research’s unique strengths, particularly in this project, is the opportunity to work with such a diverse set of collaborators on such a multidisciplinary project like we have. How do you all think where you’ve done this work has impacted how you’ve gone about it and the contributions you’ve been able to make?

NGUYEN: I’m going to start with if we look around this table and we see who’s sitting at it, which is two chemists, a computer architect, and a coding theorist, and we come together and we’re like, what can we make that would be super, super impactful? I think that’s the answer right there, is that being at Microsoft and being in a culture that really fosters this type of interdisciplinary collaboration is the key to getting a project like this off the ground.

SMITH: Yeah, absolutely. And we should acknowledge the gigantic contributions made by our collaborators at the University of Washington. Many of them would fall in not any of these three categories. They’re electrical engineers, they’re mechanical engineers, they’re pure biologists that we worked with. And each of them brought their own perspective, and particularly when you talk about going to a true end-to-end system, those perspectives were invaluable as we were trying to fit all the puzzle pieces together.

STRAUSS: Yeah, absolutely. We’ve had great collaborations over time—University of Washington, ETH Zürich, Los Alamos National Lab, ChipIr, Twist Bioscience, Ansa Biotechnologies. Yeah, it’s been really great and a great set of different disciplines, all the way from coding theorists to the molecular biology and chemistry, electrical and mechanical engineering. One of the great things about research is there’s never a shortage of interesting questions to pursue, and for us, this particular work has opened the door to research in adjacent domains, including sustainability fields. DNA data storage requires small amounts of materials to accommodate the large amounts of data, and early on, we wanted to understand if DNA data storage was, as it seemed, a more sustainable way to store information. And we learned a lot. Bichlien and Jake, you had experience in green chemistry when you came to Microsoft. What new findings did we make, and what sustainability benefits do we get with DNA data storage? And, finally, what new sustainability work has the project led to?

NGUYEN: As a part of this project, if we’re going to bring new technologies to the forefront, you know, to the world, we should make sure that they have a lower carbon footprint, for example, than previous technologies. And so we ran a life cycle assessment—which is a way to systematically evaluate the environmental impacts of anything of interest—and we did this on DNA data storage and compared it to electronic storage medium[1], and we noticed that if we were able to store all of our digital information in DNA, that we would have benefits associated with carbon emissions. We would be able to reduce that because we don’t need as much infrastructure compared to the traditional storage methods. And there would be an energy reduction, as well, because this is a passive way of archival data storage. So that was, you know, the main takeaways that we had. But that also, kind of, led us to think about other technologies that would be beneficial beyond data storage and how we could use the same kind of life cycle thinking towards that.

SMITH: This design approach that you’ve, you know, talked about us stumbling on, not inventing but seeing other people doing in the literature and trying to implement ourselves on the DNA Data Storage Project, you know, is something that can be much bigger than any single material. And where we think there’s a, you know, chance for folks like ourselves at Microsoft Research to make a real impact on this sustainability-focused design is through the application of machine learning, artificial intelligence—the new tools that will allow us to look at much bigger design spaces than we could previously to evaluate sustainability metrics that were not possible when everything was done manually and to ultimately, you know, at the end of the day, take a sustainability-first look at what a material should be composed of. And so we’ve tried to prototype this with a few projects. We had another wonderful collaboration with the University of Washington where we looked at recyclable circuit boards and a novel material called a vitrimer that it could possibly be made out of[2]. We’ve had another great collaboration with the University of Michigan, where we’ve looked at the design of charge-carrying molecules in these things called flow batteries that have good potential for energy smoothing in, you know, renewables production, trying to get us out of that day-night, boom-bust cycle[3]. And we had one more project, you know, this time with collaborators at the University of Berkeley, where we looked at, you know, design of a class of materials called a metal organic framework, which have great promise in low-energy-cost gas separation, such as pulling CO₂ out of the, you know, plume of a smokestack or, you know, ideally out of the air itself[4].

STRAUSS: For me, the DNA work has made me much more open to projects outside my own research area—as Bichlien mentioned, my core research area is computer architecture, but we’ve ventured in quite a bit of other areas here—and going way beyond my own comfort zone and really made me love interdisciplinary projects like this and try, really try, to do the most important work I can. And this is what attracted me to these other areas of environmental sustainability that Bichlien and Jake covered, where there’s absolutely no lack of problems. Like them, I’m super interested in using AI to solve many of them. So how do each of you think working on the DNA Data Storage Project has influenced your research approach more generally and how you think about research questions to pursue next?

YEKHANIN: It definitely expanded the horizons a lot, like, just, kind of, just having this interactions with people, kind of, whose core areas of research are so different from my own and also a lot of learning even within my own field that we had to do to, kind of, carry this project out. So, I mean, it was a great and rewarding experience.

NGUYEN: Yeah, for me, it’s kind of the opposite of Karin, right. I started as an organic chemist and then now really, one, appreciate the breadth and depth of going from a concept to a real end-to-end prototype and all the requirements that you need to get there. And then also, really the importance of having, you know, a background in computer science and really being able to understand the lingo that is used in multidisciplinary projects because you might say something and someone else interprets it very differently, and it’s because you’re not speaking the same language. And so that understanding that you have to really be … you have to learn a little bit of vocabulary from each person and understand how they contribute and then how your ideas can contribute to their ideas has been really impactful in my career here.

SMITH: Yeah, I think the key change in approach that I took away—and I think many of us took away from the DNA Data Storage Project—was rather than starting with an academic question, we started with a vision of what we wanted to happen, and then we derived the research questions from analyzing what would need to happen in the world—what are the bottlenecks that need to be solved in order for us to achieve, you know, that goal? And this is something that we’ve taken with us into the sustainability-focused research and, you know, something that I think will affect all the research I do going forward.

STRAUSS: Awesome. As we close, let’s reflect a bit on what a world in which DNA data storage is widely used might look like. If everything goes as planned, what do you hope the lasting impact of this work will be? Sergey, why don’t you lead us off.

YEKHANIN: Sure, I remember that, like, when … in the early days when I started working on this project actually, you, Karin, told me that you were taking an Uber ride somewhere and you were talking to the taxi driver, and the taxi driver—I don’t know if you remember that—but the taxi driver mentioned that he has a camera which is recording everything that’s happening in the car. And then you had a discussion with him about, like, how long does he keep the data, how long does he keep the videos. And he told you that he keeps it for about a couple of days because it’s too expensive. But otherwise, like, if it weren’t that expensive, he would keep it for much, much longer because, like, he wants to have these recordings if later somebody is upset about the ride and, I don’t know, he is getting sued or something. So this is, like, this is one small narrow application area where DNA data storage would clearly, kind of, if it happens, then it will solve it. Because then, kind of, this long-term archival storage will become very cheap, available to everybody; it would become a commodity basically. There are many things that will be enabled, like this helping the Uber drivers, for instance. But also one has to think of, of course, like, about, kind of, the broader implications so that we don’t get into something negative because again this power of recording everything and storing everything, it can also lead to some use cases that might be, kind of, morally wrong. So, again, hopefully by the time that we get to, like, really wide deployments of this technology, the regulation will also be catching up and the, like, we will have great use cases and we won’t have bad ones. I mean, that’s how I think of it. But definitely there are lots of, kind of, great scenarios that this can enable.

SMITH: Yeah. I’ll grab onto the word you use there, which is making DNA a commodity. And one of the things that I hope comes out of this project, you know, besides all the great benefits of DNA data storage itself is spillover benefits into the field of health—where if we make DNA synthesis at large scale truly a commodity thing, which I hope some of the work that we’ve done to really accelerate the throughput of synthesis will do—then this will open new doors in what we can do in terms of gene synthesis, in terms of, like, fundamental biotech research that will lead to that next set of drugs and, you know, give us medications or treatments that we could not have thought possible if we were not able to synthesize DNA and related molecules at that scale.

NGUYEN: So much information gets lost because of just time. And so I think being able to recover really ancient history that humans wrote in the future, I think, is something that I really hope could be achieved because we’re so information rich, but in the course of time, we become information poor, and so I would like for our future generations to be able to understand the life of, you know, an everyday 21st-century person.

STRAUSS: Well, Bichlien, Jake, Sergey, it’s been fun having this conversation with you today and collaborating with you in all of this amazing project [MUSIC] and all the research we’ve done together. Thank you so much.

YEKHANIN: Thank you, Karin.

SMITH: Thank you.

NGUYEN: Thanks.

[MUSIC FADES]

[1] The team presented the findings from their life cycle assessment of DNA data storage in the paper Architecting Datacenters for Sustainability: Greener Data Storage using Synthetic DNA.

[2] For more information, check out the podcast episode Collaborators: Sustainable electronics with Jake Smith and Aniruddh Vashisth and the paper Recyclable vitrimer-based printed circuit boards for sustainable electronics.

[3] For more information, check out the podcast episode Collaborators: Renewable energy storage with Bichlien Nguyen and David Kwabi.

[4] For more information, check out the paper MOFDiff: Coarse-grained Diffusion for Metal-Organic Framework Design.

The post Ideas: The journey to DNA data storage appeared first on Microsoft Research.

Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo

Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) — Mon, 18 Nov 2024 16:00:00 +0000

Earlier this year, Microsoft Research announced (opens in new tab) its newest lab in Tokyo, Japan. Today, we are celebrating its grand opening, reinforcing Microsoft Research’s commitment to AI research across the Asia-Pacific region. This new lab will focus on embodied AI, well-being and neuroscience, societal AI, and industry innovation—all areas that align with Japan’s socioeconomic priorities. This initiative will enhance collaboration with local academic and industrial partners, contributing to global innovation and talent development.

We recently spoke with Yasuyuki Matsushita, head of the newly established Tokyo lab. Matsushita, who worked at Microsoft Research Asia from 2003 to 2015, served as a professor at Osaka University for the past decade, before returning in October. He reflects on his journey, the evolution of technology, and the opportunities ahead for Microsoft Research Asia – Tokyo.

Yasuyuki Matsushita, Microsoft Research Asia – Tokyo

Why return to Microsoft Research Asia?

Question: We are excited to have you leading the new lab in Tokyo. You worked at Microsoft Research Asia in Beijing from 2003 to 2015 before transitioning to academia. What motivated you to return after nearly a decade?

Yasuyuki Matsushita: Microsoft Research Asia has always been an exceptional place for conducting cutting-edge research, especially in the AI era. Earlier this year, I learned about Microsoft Research Asia’s expansion, including the establishment of a new lab in Tokyo. This presented an exciting opportunity to make a meaningful impact both locally and globally, sparking my motivation to return. Additionally, Microsoft is at the forefront of AI advancements, making this an ideal moment to re-engage. I’m confident that my work can contribute meaningfully to this dynamic field. The pace of AI development today is unmatched, making this an exhilarating time to be involved.

What has changed over the decade?

Question: Now that you’ve been back for a few weeks, from your perspective, what has changed at Microsoft Research Asia, and what has remained the same since you were last here?

Yasuyuki Matsushita: The most immediate change I’ve noticed is the array of employee tools and resources, which have evolved significantly over the past decade. I’m still familiarizing myself with these new systems, designed to optimize efficiency and collaboration. Over the past ten years, Microsoft has played a key role in driving digital transformation for other companies, and it has also transformed internally.

Beyond these changes, much of what made Microsoft Research Asia unique remains the same. The culture and people continue to foster an environment of innovation and collaboration. The organization still attracts exceptional talent, and the spirit of research is as vibrant as ever. One of its greatest strengths is its open, collaborative approach. It has maintained long-standing partnerships with universities and research institutions, which encourage cross-regional, cross-cultural, and interdisciplinary exchanges. This synergy stimulates innovation and supports industry development. The commitment to excellence remains at the heart of Microsoft Research Asia’s identity.

Plans for the Microsoft Research Asia – Tokyo lab

Question: With Microsoft Research Asia expanding regionally to places like Tokyo, Vancouver, Singapore, and Hong Kong, what are your plans as the head of the Tokyo lab, and how do you see it contributing to the region’s innovation ecosystem?

Yasuyuki Matsushita: My primary goal is to align the Tokyo lab’s growth with Microsoft Research Asia’s mission to advance science and technology for the benefit of humanity. The research efforts we’re focusing on in this lab aim to address pressing societal issues while advancing AI technologies to benefit society as a whole.

For instance, Japan’s aging population presents unique challenges that require efficient societal solutions—an issue faced by many nations today. Through our research, we aim to generate insights that can be applied globally to proactively address and mitigate such challenges.

Japan also has a strong legacy of scientific research in fields like electronics, materials science, and robotics. Its advanced industrial base, featuring renowned companies across the automotive, electronics, and machinery sectors, provides rich application scenarios for our research outcomes. Additionally, Japan’s robust education system supplies an intellectual foundation crucial for our in-depth research.

We’re dedicated to maintaining open research practices. By publishing our findings and open-sourcing our tools, we ensure our work benefits the broader industry and enriches the global knowledge pool. Our goal is to share insights that drive societal progress and innovation worldwide.

Cultivating the next generation

Question: Talent is at the heart of Microsoft Research’s mission and culture. What kind of talent is Microsoft Research Asia – Tokyo looking for? In what ways can the Tokyo lab enhance its efforts to cultivate the next generation of tech innovators for the region?

Yasuyuki Matsushita: One of the key advantages of being part of Microsoft is the close connection we have to real-world applications. This bridge between research and practice allows our work to have a direct societal impact, ensuring that innovative technology results in meaningful and beneficial outcomes.

When recruiting new talent, we seek bright, self-driven individuals with an innate curiosity and a passion for solving societal challenges. The most vital trait we look for is a deep desire to understand the “why” behind complex problems. While technical expertise is essential, a commitment to addressing social issues fuels creativity and drives meaningful progress. This blend of curiosity and purpose sparks innovation and propels us forward at Microsoft Research Asia.

At the Tokyo lab, a core part of our vision is cultivating the next wave of tech innovators. We plan to build on the legacy of successful talent programs that Microsoft Research Asia has championed throughout the region, like joint research initiatives, visiting scholar programs, and internship opportunities. These provide early career professionals and students with invaluable hands-on experiences, equipping them with essential research skills and deepening their understanding of complex technological challenges.

We’re committed to creating a nurturing environment where talent can thrive, collaborate, and contribute to the global tech landscape. By combining innovation with real-world impact, we aim to inspire the next generation to push boundaries and advance society.

Rapid evolution in computer vision

Question: In today’s world, everything is moving toward digitization and intelligence. Ten years ago, your research focused on photometry and video analysis. Can you share some key outcomes from that period and explain how you see emerging technologies like AI influencing the field of computer vision?

Yasuyuki Matsushita: Back then, my research centered on computer vision, specifically on photometry for 3D reconstruction and video analysis aimed at enhancing video quality. One of the standout projects during that period was the development of a gigapixel camera capable of capturing high-resolution 3D information. This camera played a crucial role in the Dunhuang Mogao Grottoes project, which sought to digitally preserve the cultural heritage of Dunhuang’s murals and Buddha shrines with unprecedented accuracy.

Another notable project was the development of video stabilization technology, which was integrated into Windows 7 as part of Media Foundation. This technology improved video quality by compensating for unwanted camera movements, delivering smoother and more professional-looking output. The creation of real-time algorithms capable of processing and stabilizing video was groundbreaking at that time.

Since then, the introduction of deep learning, large datasets, and sophisticated neural network architectures has propelled computer vision to new heights. Tasks that were once considered difficult, such as object detection, recognition, and segmentation, are now standard with modern AI techniques. Current research continues to push the boundaries by exploring innovative network architectures, new learning strategies, and enhanced datasets. A particularly exciting trend is the use of AI in real-world interactive scenarios, leading to the emergence of embodied AI, which is a major focus of my current work.

Understanding embodied AI beyond robotics

Question: Your current research interests include embodied AI, which is also one of the key areas at Microsoft Research Asia – Tokyo. What exactly is embodied AI, and how does it differ from robotics?

Yasuyuki Matsushita: Embodied AI goes beyond traditional robotics. While robots are typically machines equipped with actuators designed to execute specific tasks, embodied AI focuses on developing intelligent systems that can perform complex tasks while understanding and interacting within physical and virtual environments. Robotics and AI have developed independently, but embodied AI is the convergence of these two fields, integrating AI with physical agents that can perceive, act, and learn in dynamic real-world environments.

This field is inherently interdisciplinary, involving aspects such as robotic control, reinforcement learning, spatial awareness, human-robot interaction, reasoning, and more. For instance, embodied AI includes the ability to infer cause and effect, such as understanding that an unsupported laptop will fall due to gravity. These types of interactions and interpretations stem from engaging with and understanding the physical world, making embodied AI an exciting and multifaceted area of research.

Given the complexity of embodied AI, no single organization can cover all aspects of its development alone. We look forward to collaborating with local industry and academic institutions in Japan, leveraging their expertise alongside our strengths in AI to advance the field.

Advice for aspiring researchers in computer vision and AI

Question: You’ve had an extensive career spanning academia and industry. From your experience as both an educator and a researcher, what advice would you give to young people interested in pursuing research in computer vision and AI?

Yasuyuki Matsushita: For students interested in computer vision and AI, a strong foundation in mathematics and computer science is essential, even as specific research topics and technologies evolve. A deep understanding of fundamental mathematical concepts, such as gradients, Jacobians, and vector spaces, is indispensable. Mastery of these principles will be beneficial regardless of changes in programming languages or development platforms.

Maintaining a mindset of continuous learning is equally important, as the field is constantly evolving. For example, deep learning was not as prominent a decade ago but is now central to the field. At Microsoft, we emphasize the importance of a growth mindset—being adaptable, open to new technologies, and willing to pivot with industry advancements. Early career professionals should cultivate the ability to quickly acquire new skills while building on their foundational knowledge. This adaptability is key to long-term success in research and development.

The post Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo appeared first on Microsoft Research.

BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis

Brenda Potts — Mon, 18 Nov 2024 10:14:40 +0000

In cancer diagnosis or advanced treatments like immunotherapy, every detail in a medical image counts. Radiologists and pathologists rely on these images to track tumors, understand their boundaries, and analyze how they interact with surrounding cells. This work demands pinpoint accuracy across several tasks—identifying whether a tumor is present, locating it precisely, and mapping its contours on complex CT scans or pathology slides.

Yet, these crucial steps—object recognition, detection, and segmentation—are often tackled separately, which can limit the depth of analysis. Current tools like MedSAM (opens in new tab) and SAM (opens in new tab) focus on segmentation only, thus missing out on the opportunity to blend these insights holistically and relegating object as an afterthought.

In this blog, we introduce BiomedParse (opens in new tab), a new approach for holistic image analysis by treating object as the first-class citizen. By unifying object recognition, detection, and segmentation into a single framework, BiomedParse allows users to specify what they’re looking for through a simple, natural-language prompt. The result is a more cohesive, intelligent way of analyzing medical images that supports faster, more integrated clinical insights.

While biomedical segmentation datasets abound, there are relatively few prior works on object detection and recognition in biomedicine, let alone datasets covering all three tasks. To pretrain BiomedParse, we created the first such dataset by harnessing OpenAI’s GPT-4 for data synthesis from standard segmentation datasets (opens in new tab).

BiomedParse is a single foundation model that can accurately segment biomedical objects across nine modalities, as seen in Figure 1, outperforming prior best methods while requiring orders of magnitude fewer user operations, as it doesn’t require an object-specific bounding box. By learning semantic representation for individual object types, BiomedParse’s superiority is particularly pronounced in the most challenging cases with irregularly shaped objects. Through joint pretraining of object recognition, detection, and segmentation, BiomedParse opens new possibilities for holistic image analysis and image-based discovery in biomedicine.

Figure 1. Overview of BiomedParse and BiomedParseData.

Image parsing: a unifying framework for holistic image analysis

Back in 2005, researchers first introduced the concept of “image parsing”—a unified approach to image analysis that jointly conducts object recognition, detection, and segmentation. Built on Bayesian networks, this early model offered a glimpse into a future of joint learning and reasoning in image analysis, though it was limited in scope and application. Fast forward to today, cutting-edge advances in generative AI have breathed new life into this vision. With our model, BiomedParse, we have created a foundation for biomedical image parsing that leverages interdependencies across the three subtasks, thus addressing key limitations in traditional methods. BiomedParse enables users to simply input a natural-language description of an object, which the model uses to predict both the object label and its segmentation mask, thus eliminating the need for a bounding box (Figure 1c). In other words, this joint learning approach lets users segment objects based on text alone.

Harnessing GPT-4 for large-scale data synthesis from existing datasets

We created the first dataset for biomedical imaging parsing by harnessing GPT-4 for large-scale data synthesis from 45 existing biomedical segmentation datasets (Figure 1a and 1b). The key insight is to leverage readily available natural-language descriptions already in these datasets and use GPT-4 to organize this often messy, unstructured text with established biomedical object taxonomies.

Specifically, we use GPT-4 to help create a unifying biomedical object taxonomy for image analysis and harmonize natural language descriptions from existing datasets with this taxonomy. We further leverage GPT-4 to synthesize additional variations of object descriptions to facilitate more robust text prompting.

This enables us to construct BiomedParseData, a biomedical image analysis dataset comprising over 6 million sets of images, segmentation masks, and text descriptions drawn from more than 1 million images. This dataset includes 64 major biomedical object types, 82 fine-grained subtypes, and spans nine imaging modalities.

Figure 2: Comparison on large-scale biomedical image segmentation datasets.

State-of-the-art performance across 64 major object types in 9 modalities

We evaluated BiomedParse on a large held-out test set with 102,855 image-mask-label sets across 64 major object types in nine modalities. BiomedParse outperformed prior best methods such as MedSAM and SAM, even when oracle per-object bounding boxes were provided. In the more realistic setting when MedSAM and SAM used a state-of-the-art object detector (Grounding DINO) to propose bounding boxes, BiomedParse outperformed them by a wide margin, between 75 and 85 absolute points in dice score (Figure 2a). BiomedParse also outperforms a variety of other prominent methods such as SegVol, Swin UNETR, nnU-Net, DeepLab V3+, and UniverSeg.

Figure 3. Evaluation on detecting irregular-shaped objects.

Recognizing and segmenting irregular and complex objects

Biomedical objects often have complex and irregular shapes, which present significant challenges for segmentation, even with oracle bounding box. By joint learning with object recognition and detection, BiomedParse learns to model object-specific shapes, and its superiority is particularly pronounced for the most challenging cases (Figure 3). Encompassing a large collection of diverse object types in nine modalities, BiomedParseData also provides a much more realistic representation of object complexity in biomedicine.

Figure 4. Evaluation on object recognition.

Promising step toward scaling holistic biomedical image analysis

By operating through a simple text prompt, BiomedParse requires substantially less user effort than prior best methods that typically require object-specific bounding boxes, especially when an image contains a large number of objects (Figure 2c). By modeling object recognition threshold, BiomedParse can detect invalid prompt and reject segmentation requests when an object is absent from the image. BiomedParse can be used to recognize and segment all known objects in an image in one fell swoop (Figure 4). By scaling holistic image analysis, BiomedParse can potentially be applied to key precision health applications such as early detection, prognosis, treatment decision support, and progression monitoring.

Going forward, there are numerous growth opportunities. BiomedParse can be extended to handle more modalities and object types. It can be integrated into advanced multimodal frameworks such as LLaVA-Med (opens in new tab) to facilitate conversational image analysis by “talking to the data.” To facilitate research in biomedical image analysis, we have made BiomedParse open-source (opens in new tab) with Apache 2.0 license. We’ve also made it available on Azure AI (opens in new tab) for direct deployment and real-time inference. For more information, check out our demo. (opens in new tab)

BiomedParse is a joint work with Providence and the University of Washington’s Paul G. Allen School of Computer Science & Engineering, and brings collaboration from multiple teams within Microsoft*. It reflects Microsoft’s larger commitment to advancing multimodal generative AI for precision health, with other exciting progress such as GigaPath (opens in new tab), BiomedCLIP (opens in new tab),  LLaVA-Rad (opens in new tab), BiomedJourney (opens in new tab), MAIRA (opens in new tab), Rad-DINO (opens in new tab), Virchow (opens in new tab). 

(Acknowledgment footnote) *: Within Microsoft, it is a wonderful collaboration among Health Futures, MSR Deep Learning, and Nuance.

Paper co-authors: Theodore Zhao, Yu Gu, Jianwei Yang (opens in new tab), Naoto Usuyama (opens in new tab), Ho Hin Lee, Sid Kiblawi, Tristan Naumann (opens in new tab), Jianfeng Gao (opens in new tab), Angela Crabtree, Jacob Abel, Christine Moung-Wen, Brian Piening, Carlo Bifulco, Mu Wei, Hoifung Poon (opens in new tab), Sheng Wang (opens in new tab)

The post BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis appeared first on Microsoft Research.

GraphRAG: Improving global search via dynamic community selection

Brenda Potts — Fri, 15 Nov 2024 16:52:47 +0000

Retrieval-augmented generation (RAG) allows AI systems to provide additional information and context to a large language model (LLM) when generating a response to a user query. However, traditional RAG-based methods can struggle to retrieve information that requires high-level knowledge of the entire dataset, especially with abstract and global questions such as the keywordless query: “Catch me up on the last two weeks of updates.” These types of queries are known as “global” queries, as they require holistic understanding of the dataset to answer the question. GraphRAG aims to tackle these questions in two main steps: indexing and query. The indexing engine first breaks down a collection of text documents into segments which are then clustered into hierarchical communities with entities and relationships connecting each segment up through higher levels of abstraction. We then use an LLM to generate a summary of each community, known as a community report. The indexing engine thus creates a hierarchical knowledge graph of the dataset, with each level in the hierarchy representing a different level of abstraction and summarization of the original material. In the query step, GraphRAG uses this structured knowledge to provide additional context to the LLM to help answer the question. In this blog post, we show a new method for conducting “global” queries that efficiently utilizes the knowledge graph representation and optimizes the performance of global search in GraphRAG.

Static vs. dynamic global search

The global search (opens in new tab) algorithm in GraphRAG aims to answer abstract questions that require knowledge of the entire dataset. It generates answers by searching over communities at a predetermined level in the knowledge graph. Then the LLM combines and summarizes all the community reports at this level of abstraction. Finally, the summary is used as additional context for the LLM to generate the response to the user question. This map-reduce process allows the LLM to select relevant text from all the community reports to generate its final answer. This static approach is expensive and inefficient because it includes many lower-level reports that are not informative to the user query. Since it is unlikely that all community reports, especially at a high level, are relevant in answering the query, an approach that first considers the relevancy of the report prior to the resource-intensive map-reduce operation is highly desirable.

Here, we introduce dynamic community selection to the global search algorithm, which leverages the knowledge graph structure of the indexed dataset. Starting from the root of the knowledge graph, we use an LLM to rate how relevant a community report is in answering the user question. If the report is deemed irrelevant, we simply remove it and its nodes (or sub-communities) from the search process. On the other hand, if the report is deemed relevant, we then traverse down its child nodes and repeat the operation. Finally, only relevant reports are passed to the map-reduce operation to generate the response to the user. Figure 1 illustrates the dynamic community selection process in action.

Figure 1: Dynamic community selection workflow

The dynamic global search approach has two main benefits. First, it prunes irrelevant reports early on, reducing the total number of community reports to be considered in the map-reduce operation. Second, it enables users to search the entire knowledge graph, instead of predefining a static community level, and can lead to more detailed answers. This allows it to collect information at various levels of abstraction. Moreover, the rating operation is a classification problem, which is considerably easier to perform than summarization and text generation, therefore, a less complex model can be used. In our experiments leveraging OpenAI’s models, a GPT-4o-mini rater achieved a very similar retrieval rate as a GPT-4o rater, while operating at a fraction of both cost and time. Overall, we use the smaller and more cost-effective model, GPT-4o-mini, in the rate operation to prune any irrelevant community reports, then we use GPT-4o to perform the map-reduce operation to generate the final response.

Dynamic community selection on the AP News dataset

To demonstrate the cost saving that dynamic global search brings while maintaining a similar response quality, we evaluated the two methods side by side on a dataset from AP News. We tested static and dynamic search on 50 global questions and assessed the final response quality using an LLM evaluator. Moreover, we compared the total token cost of the two methods. To compare the two methods directly, we constrained the maximum search depth on dynamic global search so that both methods used the same underlying information.

We use an LLM evaluator to select the best response (i.e. win rate) on 3 key metrics:

Comprehensiveness: How much detail does the answer provide to cover all the aspects and details of the question?
Diversity: How varied and rich is the answer in providing different perspectives and insights on the question?
Empowerment: How well does the answer help the reader understand and make informed judgements about the topic?

Significant cost reduction while maintaining output quality

The quality of responses generated from dynamic community selection are comparable to its static counterpart while reducing the total token cost. Our LLM evaluation shows that the output quality of the two methods is similar in the three key metrics across the 50 global questions on the AP News dataset, with no statistical significance between them. More importantly, we observed a significant reduction of total token costs when using the new method, with an average cost reduction of 77% over the existing static global search at community level 1. This is due to the large number of community reports being eliminated via the rating process, thus requiring fewer prompt and output tokens needed in the map-reduce operation. For instance, the existing static global search method processes about 1,500 level 1 community reports in the map-reduce operation, while only 470 community reports on average are selected in dynamic search to generate the final answer.

Moreover, if we allow dynamic search to continue the rating process further to deeper level community reports, we observe an improvement in its final responses. Here, we conducted the same experiment but allowed dynamic search to continue until community level 3. Out of the 50 global questions, 29 included more community reports than our static search baseline, suggesting that some community reports at deeper levels are relevant to the user question. Indeed, we observed a moderate and statistically significant improvement in both comprehensiveness and empowerment. Using an LLM evaluator to score pairs of responses, we observe that dynamic global search scores a win rate of 58% and 60%, respectively, against static search at level 1. Nevertheless, while the rating operation is performed by a smaller model and hence induces negligible cost, it can still lead to a higher overall cost due to the increased number of community reports that the map-reduce operation processes. In this experiment, the total cost with dynamic search at level 3 is 34% higher on average. Table 1 summarizes the results of static search at level 1 against dynamic search at level 1 and 3.

Dynamic search	Num. queries	Win rate against static search (level 1)			Relative cost
Dynamic search	Num. queries	Comprehensiveness	Diversity	Empowerment	Relative cost
Level 1	50	49.5%	46.0%	48.0%	-77%
Level 3	29*	58.8%	48.0%	60.0%	+34%

Table 1. LLM evaluation on dynamic search vs static search at different community levels on global questions. We show the win rates of dynamic search at level 1 and level 3 against static search at level 1, that is the number of times the LLM evaluator prefers the generated responses from dynamic search over static search. Green background indicates there is statistical significance in the win rates, and grey cells if it is not significant. The right-most column shows the average relative token cost against static search at level 1. * Note that we only evaluate answers from dynamic search at community level 3, which contains more community reports than static search at level 1.

Overall, dynamic community selection achieved a cost reduction of 77% while achieving output quality similar to the baseline static approach. These appear to be promising numbers, but what of the result quality? Let’s examine some direct side-by-side comparisons with an example:

Comparable response quality and detail

Query: Across the dataset, describe the common trends in vaccination rates for major diseases.

Static global search at level 1

Global search with dynamic community selection at level 3

### Common Trends in Vaccination Rates for Major Diseases

#### Decline in Vaccination Rates

A significant trend observed across various regions is the decline in vaccination rates for major diseases. For instance, Minnesota has experienced a decrease in immunization rates, leading to recent measles outbreaks [Data: Reports (335)]. Similarly, the COVID-19 pandemic has exacerbated this issue, causing a drop in childhood immunization rates in Michigan and other areas [Data: Reports (1231, 915)]. This decline is often attributed to misinformation about vaccines, particularly the false claim linking the MMR vaccine to autism [Data: Reports (912, 1335)].

#### Impact of COVID-19

The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to increase COVID-19 vaccination rates have been a major focus of public health policies, but the pandemic has also led to a decline in routine vaccinations among children [Data: Reports (920, 194)]. The pandemic has influenced public attitudes towards health measures and vaccinations, contributing to broader public health challenges [Data: Reports (920)].

#### Regional Variations

Vaccination rates vary significantly by region. West Virginia, for example, has one of the strictest school vaccination policies in the U.S., resulting in high child immunization rates [Data: Reports (1199)]. In contrast, Idaho has the highest overall childhood vaccination exemption rate in the U.S., posing significant public health risks [Data: Reports (998)]. Illinois faces challenges with low vaccination rates in schools, leading to measles outbreaks [Data: Reports (108, 82)].

#### Public Health Initiatives

Various public health initiatives have been implemented to address vaccination rates. The CDC plays a crucial role in monitoring and reporting on vaccination rates for diseases such as measles, mumps, and polio, and provides vital information on preventable diseases [Data: Reports (914)]. Programs like the ‘Do It For Babydog’ sweepstakes in West Virginia encourage vaccination among residents [Data: Reports (1199)]. Additionally, the Bill & Melinda Gates Foundation is heavily involved in funding and supporting vaccination programs for major diseases such as polio, malaria, HIV, and cholera [Data: Reports (1091)].

### Conclusion

The dataset reveals a complex landscape of vaccination rates influenced by regional policies, public health initiatives, and the impact of the COVID-19 pandemic. While some regions have managed to maintain high vaccination rates through strict policies and public health campaigns, others face significant challenges due to misinformation and exemption rates. Global efforts and collaborations remain crucial in addressing these challenges and improving vaccination coverage worldwide.

### Common Trends in Vaccination Rates for Major Diseases

#### Decline in Vaccination Rates

A significant trend observed across the dataset is the decline in vaccination rates for various diseases, including measles, mumps, rubella (MMR), and polio. This decline is particularly evident among U.S. kindergartners, where vaccination coverage has dropped from 95.2% during the 2019-2020 school year to 93.1% in the 2022-2023 school year. This reduction falls below the 95% threshold needed to achieve herd immunity, putting approximately 250,000 kindergartners at risk each year [Data: Reports (5668, 15469, 15703, 15481, 15471, +more)].

#### Impact of the COVID-19 Pandemic

The COVID-19 pandemic has had a profound impact on vaccination rates. Efforts to promote COVID-19 vaccinations have been a major focus, with initiatives in regions like Michigan and Illinois offering free meals to students and promoting updated COVID-19 shots [Data: Reports (19125, 15466)]. However, the pandemic has also disrupted routine vaccination programs, leading to lower coverage for diseases such as measles and contributing to recent outbreaks [Data: Reports (15464, 15674, 15482, 15481, +more)].

#### Regional Variations and Legislative Measures

There are notable regional variations in vaccination rates and exemption rates across the United States. For example, West Virginia has implemented strict immunization requirements for children entering school, resulting in high child immunization rates [Data: Reports (5674, 18874, 18899)]. In contrast, states like Minnesota and Illinois have seen increases in non-medical exemptions, contributing to lower vaccination rates and recent outbreaks [Data: Reports (15483, 15481, 108, 2705, +more)].

#### Efforts to Improve Vaccination Rates

Various initiatives and legislative measures have been introduced to address declining vaccination rates. For instance, the Government of Sindh introduced a polio vaccination bill that includes provisions for imprisonment for parents who do not vaccinate their children [Data: Reports (15398)]. In the U.S., the CDC has recommended new COVID-19 shots for everyone aged 6 months and older and has launched initiatives to ensure equitable access to vaccines, especially in developing countries [Data: Reports (15847, 15571, 15691, 15694, +more)].

### Conclusion

The dataset reveals a complex landscape of vaccination rates influenced by the COVID-19 pandemic, vaccine hesitancy, misinformation, and regional variations. While efforts to improve vaccination rates are ongoing, the decline in immunization coverage poses significant public health risks, highlighting the need for continued vigilance and proactive measures to ensure high vaccination rates and prevent outbreaks of vaccine-preventable diseases.

Table 2. Generated response from static search (level 1) and dynamic search (level 3) to the same global question on the AP News dataset.

Table 2 shows an example output from static search at level 1 and dynamic search at level 3 to the same question. While the two outputs contain similar high-level topics, the response from dynamic search provided specific data such as the reduction of vaccination rates in certain demographics. We also notice that the response from dynamic search made significantly more references to the source material, indicated by “[Data Reports]” in the text. By selectively providing information that is relevant to the question, this alleviates the map-reduce operation from having to filter and process all the community reports all at once, and therefore it can generate a response that is more comprehensive and specific to the user question.

Overall, dynamic community selection proposes an alternative method to perform global search in GraphRAG by leveraging the indexed knowledge graph and the usage of cheaper LLM models in the rate relevancy operation. These changes led to lower total token cost and potential improvements to response detail and quality.

Availability

Blog Introducing DRIFT Search: Combining global and local search methods to improve quality and efficiency

You can experiment with dynamic global search on the GraphRAG GitHub repository (opens in new tab).

Dynamic global search is the second of several major optimizations to GraphRAG that are being explored. If you are interested in optimizations for local questions, please check out our recent blog post on DRIFT search. Stay tuned for our upcoming work, where we explore a radically different approach to graph-enabled RAG that is significantly more cost-efficient while improving answer quality for both local and global questions.

The post GraphRAG: Improving global search via dynamic community selection appeared first on Microsoft Research.

Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators

Brenda Potts — Thu, 14 Nov 2024 17:00:00 +0000

Our work on Orca and Orca 2 demonstrated the power of using synthetic data for the post-training of small language models and getting them to levels of performance previously found only in much larger language models. Orca-AgentInstruct is another step in this direction, where we explore using agentic flows to generate diverse and high-quality data at scale. Orca-AgentInstruct is an agentic solution for synthetic-data generation. By leveraging an agentic framework, AgentInstruct can generate tailored datasets, comprising both prompts and responses, from raw data sources, paving the way to building a synthetic data factory for model fine-tuning.

The efficacy of this approach is exemplified by the substantial improvement observed by fine-tuning a base Mistral 7-billion-parameter model and using AgentInstruct to generate a 25-million-pair dataset. The fine-tuned model (which we refer to as Orca-3-Mistral) showcases a notable performance gain across multiple benchmarks. For example, it shows 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, 45% improvement on AlpacaEval, and a 31.34% reduction of inaccurate or unreliable results across multiple summarization benchmarks.

We are making a 1-million-pair subset (orca-agentinstruct-1M) of this dataset publicly available, along with a report describing the data generation procedure, to encourage research on synthetic data generation and finetuning of language models.

Figure 1: Effect of using AgentInstruct for post-training Mistral-7B.

Figure 2. This figure provides a thematic overview of the roles played by different groups of agents. Content Transformation Flow converts the seed into an intermediate representation that makes it easier to create high-quality and diverse data. Seed Instruction Generation Flow creates instances of the target tasks following a taxonomy. Instruction Refinement Flow explores the space further by starting from these initial data points and exploring the neighborhood. The expectation is that by picking a random seed we will be able to cover the entire region of data points.

Synthetic Data Accelerated LLM Development: Over the past year, using synthetic data has greatly advanced the training of large language models (LLMs). It sped up model training at all stages, from pre-training (e.g., Phi-3) to instruction-tuning (e.g., Orca and WizardLM) and reinforcement learning from human feedback (e.g., Direct Nash Optimization).

Demo video AgentInstruct Methodology

Generating high-quality synthetic data is hard: On the other hand, research indicates that pre-training models on synthetic data produced by other models can result in model collapse, causing models to progressively degrade. Similar concerns have been raised regarding the use of synthetic data for post-training, suggesting that it might lead to an imitation process where the trained model learns only stylistic features rather than actual capabilities.

This discrepancy may be attributed to the challenge of generating high-quality and diverse synthetic data. Successful use of synthetic data involves significant human effort in curating and filtering the data to ensure high quality.

Synthetic data meets agents: Another major development we witnessed during the past year is the rise of agentic (especially multi-agent) workflows, such as with AutoGen. Agentic workflows can generate high-quality data, which surpasses the capabilities of the underlying LLMs, by using flows with reflection and iteration that enable agents to look back at solutions, generate critiques, and improve solutions. They can also use tools like search APIs, calculators, and code interpreters to address LLM limitations.

Multi-agent workflows bring in additional benefits as well, such as simulating scenarios where we can generate both new prompts and the corresponding responses. They also enable automation of data-generation workflows, reducing or eliminating the need for unnecessary human intervention on some tasks.

AgentInstruct: Generating synthetic data for post-training or finetuning often relies on an existing prompt set that is either used as is or as seeds for generating more instructions. In this work, we generalize the problem settings to a broader objective of generating an abundant amount of diverse, challenging, and high-quality data to teach a particular skill to an AI model. We refer to this setting as generative teaching.

AgentInstruct is an agentic solution for generative teaching. AgentInstruct uses raw documents as input to create demonstration and feedback data. When generic data is used as seeds, AgentInstruct can be used to teach an LLM a general capability, such as writing, reasoning, or retrieval-augmented generation (RAG). Domain specific data, like retail or finance, can also be used as seeds to improve the model in a certain specialization. AgentInstruct can create:

High-quality data: AgentInstruct uses GPT-4, coupled with tools like search and code interpreters, to create high-quality data.
Diverse data: AgentInstruct creates prompts and responses using a set of specialized agents (with powerful LLMs, tools, and reflection flows) and a taxonomy (of more than 100 subcategories), , ensuring diversity and quality.
Large quantities of data: AgentInstruct can run autonomously. and applyiflows for verification and data filtering. It does not require seed prompts and uses raw documents for seeding.

Using raw data as seeds offers two advantages: it is plentiful, allowing AgentInstruct to generate large-scale and diverse datasets, and it encourages learning general skills instead of benchmark-specific ones by avoiding using existing prompts.

We anticipate agentic flows becoming increasingly important throughout the model-training lifecycle, including pre-training, post-training, and specialization, and ultimately enabling the creation of a synthetic data factory for model customization and continuous improvement. This has the potential to drive AI advances across multiple industries by making high-quality model training more efficient and accessible.

Contributors:

Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgou, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, and Ahmed Awadallah

The post Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators appeared first on Microsoft Research.

Abstracts: November 14, 2024

Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) — Thu, 14 Nov 2024 15:00:00 +0000

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Microsoft Senior Researcher Tong Wang joins guest host Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science, to discuss “Ab initio characterization of protein molecular dynamics with AI²BMD.” In the paper, which was published by the scientific journal Nature, Wang and his coauthors detail a system that leverages AI to advance the state of the art in simulating the behavior of large biomolecules. AI²BMD, which is generalizable across a wide range of proteins, has the potential to advance solutions to scientific problems and enhance biomedical research in drug discovery, protein design, and enzyme engineering.

Read the paper

Get the code

Transcript

[MUSIC]

BONNIE KRUFT: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m Bonnie Kruft, partner and deputy director of Microsoft Research AI for Science and your host for today. Joining me is Tong Wang, a senior researcher at Microsoft. Tong is the lead author of a paper called “Ab initio characterization of protein molecular dynamics with AI²BMD,” which has just been published by the top scientific journal Nature. Tong, thanks so much for joining us today on Abstracts!

TONG WANG: Thank you, Bonnie.

KRUFT: Microsoft Research is one of the earliest institutions to apply AI in biomolecular simulation research. Why did the AI for Science team choose this direction, and—with this work specifically, AI²BMD—what problem are you and your coauthors addressing, and why should people know about it?

WANG: So as Richard Feynman famously said, “Everything that living things do can be understood in terms of the jigglings and the wigglings of atoms.” To study the mechanisms behind the biological processes and to develop biomaterials and drugs requires a computational approach that can accurately characterize the dynamic motions of biomolecules. When we review the computational research for biomolecular structure, we can get two key messages. First, in recent years, predicting the crystal, or static, protein structures with methods powered by AI has achieved great success and just won the Nobel Prize in Chemistry in the last month. However, characterizing the dynamic structures of proteins is more meaningful for biology, drug, and medicine fields but is much more challenging. Second, molecular dynamics simulation, or MD, is one of the most widely used approaches to study protein dynamics, which can be roughly divided into classical molecular dynamics simulation and quantum molecular dynamics simulation. Both approaches have been developed for more than a half century and won Nobel Prize. Classical MD is fast but less accurate, while quantum MD is very accurate but computationally prohibitive for the protein study. However, we need both the accuracy and the efficiency to detect the biomechanisms. Thus, applying AI in biomolecular simulation can become the third way to achieve both ab initio—or first principles—accuracy and high efficiency. In the winter of 2020, we have foreseen the trend that AI can make a difference in biomolecular simulations. Thus, we chose this direction.

KRUFT: It took four years from the idea to the launch of AI²BMD, and there were many important milestones along the way. First, talk about how your work builds on and/or differs from what’s been done previously in this field, and then give our audience a sense of the key moments and challenges along the AI²BMD research journey.

WANG: First, I’d like to say applying AI in biomolecular simulation is a novel research field. For AI-powered MD simulation for large biomolecules, there is no existing dataset, no well-designed machine learning model for the interactions between the atoms and the molecules, no clear technical roadmap, no mature AI-based simulation system. So we face various new challenges every day. Second, there are some other works exploring this area at the same time. I think a significant difference between AI²BMD and other works is that other works require to generate new data and train the deep learning models for any new proteins. So it takes a protein-specific solution. As a contrast, AI²BMD proposes a generalizable solution for a wide range of proteins. To achieve it, as you mentioned, there are some key milestones during the four-year journey. The first one is we proposed the generalizable protein fragmentation approach that divides proteins into the commonly used 20 kinds of dipeptides. Thus, we don’t need to generate data for various proteins. Instead, we only need to sample the conformational space of such dipeptides. So we built the protein unit dataset that contains about 20 million samples with ab initio accuracy. Then we proposed ViSNet, the graph neural network for molecular geometry modeling as the machine learning potential for AI²BMD. Furthermore, we designed AI²BMD simulation system by efficiently leveraging CPUs and GPUs at the same time, achieving hundreds of times simulation speed acceleration than one year before and accelerating the AI-driven simulation with only ten to a hundred millisecond per simulation step. Finally, we examined AI²BMD on energy, force, free energy, J coupling, and many kinds of property calculations for tens of proteins and also applied AI²BMD in the drug development competition. All things are done by the great team with science and engineering expertise and the great leadership and support from AI for Science lab.

KRUFT: Tell us about how you conducted this research. What was your methodology?

WANG: As exploring an interdisciplinary research topic, our team consists of experts and students with biology, chemistry, physics, math, computer science, and engineering backgrounds. The teamwork with different expertise is key to AI²BMD research. Furthermore, we collaborated and consulted with many senior experts in the molecular dynamics simulation field, and they provided very insightful and constructive suggestions to our research. Another aspect of the methodology I’d like to emphasize is learning from negative results. Negative results happened most of the time during the study. What we do is to constantly analyze the negative results and adjust our algorithm and model accordingly. There’s no perfect solution for a research topic, and we are always on the way.

KRUFT: AI²BMD got some upgrades this year, and as we mentioned at the top of the episode, the work around the latest system was published in the scientific journal Nature. So tell us, Tong—what is new about the latest AI²BMD system?

WANG: Good question. We posted a preliminary version of AI²BMD manuscript on bioRxiv last summer. I’d like to share three important upgrades through the past one and a half year. The first is hundreds of times of simulation speed acceleration for AI²BMD, which becomes one of the fastest AI-driven MD simulation system and leads to perform much longer simulations than before. The second aspect is AI²BMD was applied for many protein property calculations, such as enthalpy, heat capacity, folding free energy, pK_a, and so on. Furthermore, we have been closely collaborating with the Global Health Drug Discovery Institute, GHDDI, a nonprofit research institute founded and supported by the Gates Foundation, to leverage AI²BMD and other AI capabilities to accelerate the drug discovery processes.

KRUFT: What significance does AI²BMD hold for research in both biology and AI? And also, what impact does it have outside of the lab, in terms of societal and individual benefits?

WANG: Good question. For biology, AI²BMD provides a much more accurate approach than those used in the past several decades to simulate the protein dynamic motions and study the bioactivity. For AI, AI²BMD proves AI can make a big difference to the dynamic protein structure study beyond AI for the protein static structure prediction. Raised by AI²BMD and other works, I can foresee there is a coming age of AI-driven biomolecular simulation, providing binding free-energy calculation with quantum simulation accuracy for the complex of drug and the target protein for drug discovery, detecting more flexible biomolecular conformational changes that molecular mechanics cannot do, and opening more opportunities for enzyme engineering and vaccine and antibody design.

KRUFT: AI is having a profound influence on the speed and breadth of scientific discovery, and we’re excited to see more and more talented people joining us in this space. What do you want our audience to take away from this work, particularly those already working in the AI for Science space or looking to enter it?

WANG: Good question. I’d like to share three points from my research experience. First is aim high. Exploring a disruptive research topic is better than doing 10 incremental works. In the years of research, our organization always encourages us to do the big things. Second is persistence. I remembered a computer scientist previously said about 90% of the time during research is failure and frustration. The rate is even higher when exploring a new research direction. In AI²BMD study, when we suffered from research bottlenecks that cannot be tackled for several months, when we received critical comments from reviewers, when some team members wanted to give up and leave, I always encourage everyone to persist, and we will make it. More importantly, the foundation of persistence is to ensure your research direction is meaningful and constantly adjust your methodology from failures and critical feedback. The third one is real-world applications. Our aim is to leverage AI for advancing science. Proposing scientific problems is a first step, then developing AI tools and evaluating on benchmarks and, more importantly, examining its usefulness in the real-world applications and further developing your AI algorithms. In this way, you can close the loop of AI for Science research.

KRUFT: And, finally, Tong, what unanswered questions or unsolved problems remain in this area, and what’s next on the agenda for the AI²BMD team?

WANG: Well, I think AI²BMD is a starting point for the coming age of AI-driven MD for biomolecules. There are lots of new scientific questions and challenges coming out in this new field. For example, how to expand the simulated molecules from proteins to other kinds of biomolecules; how to describe the biochemical reactions during the simulations; how to further improve the simulation efficiency and robustness; and how to apply it for more real-world scenarios. We warmly welcome any people from both academic and industrial fields to work together with us to make the joint efforts to push the frontier of this new field moving forward.

[MUSIC]

KRUFT: Well, Tong, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on AI²BMD, you can find a link at aka.ms/abstracts, or you can read it on the Nature website. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: November 14, 2024 appeared first on Microsoft Research.

Toward modular models: Collaborative AI development enables model accountability and continuous learning

Brenda Potts — Wed, 13 Nov 2024 17:30:00 +0000

Today, development of generalizable AI models requires access to sufficient data and compute resources, which may create challenges for some researchers. Democratizing access to technology across the research community can advance the development of generalizable AI models. By applying the core software development concept of modularity to AI, we can build models that are powerful, efficient, adaptable, and transparent.

Until recently, AI models were primarily built using monolithic architecture. Though powerful, these models can be challenging to customize and edit compared to modular models with easily interpretable functional components. Today, developers employ modularity to make services more reliable, faster to refine, and easier for multiple users to contribute to simultaneously. One promising research direction that supports this involves shifting AI development towards a modular approach (opens in new tab), which could enhance flexibility and improve scalability.

One such approach is to use numerous fine-tuned models designed for specific tasks, known as expert models, and coordinate them to solve broader tasks (see Towards Modular LLMs by Building and Reusing a Library of LoRAs – Microsoft Research (opens in new tab), Learning to Route Among Specialized Experts for Zero-Shot Generalization (opens in new tab)). These expert models can be developed in a decentralized way. Similar to the benefits of using a microservice architecture, this modular AI approach can be more flexible, cheaper to develop, and more compliant with relevant privacy and legal policies. However, while substantial research has been done on training optimization, coordination methods remain largely unexplored.

Our team is exploring the potential of modular models by focusing on two themes: i) optimizing the training of expert models and ii) refining how expert models coordinate to form a collaborative model. One method for coordinating expert models is to adaptively select the most relevant independently developed expert models for specific tasks or queries. This approach, called MoErging, is similar to Mixture-of-Experts (MoE) approaches but differs in that the routing mechanism is learned after the individual experts are trained. As an initial step, we contributed to creating a taxonomy for organizing recent MoErging methods with the goal of helping establish a shared language for the research community and facilitating easier and fairer comparisons between different methods.

Assessing existing MoErging methods

Most MoErging methods were developed within the past year, so they don’t reference each and are difficult to compare. To enable comparison of MoErging methods, we recently collaborated on a survey that establishes a taxonomy for comparing methods and organizes MoErging design choices into three steps:

Expert design: Identifies and uses expert models trained asynchronously by distributed contributors.
Routing design: Routes tasks to the appropriate expert models.
Application design: Applies the merged models to specific tasks or domains.

Each step is broken down into more detailed choices. For example, in expert design, expert training can be custom or standard, and training data can be private or shared. Custom training requires MoErging to have specific training procedures, while the standard training does not. Similarly, shared data means that the training data must be accessible for routing. Otherwise, the training data is considered private.

The benefits of modular models discussed below assume that training data doesn’t need to be shared. However, a review of current MoErging methods finds that some approaches do require sharing training data, making certain benefits no longer applicable.

The survey evaluates 29 different MoErging methods using its taxonomy, which categorizes the design choices into two expert design choices, five routing design choices, and two application design options, shown in Figure 1.

Figure 1: Taxonomy of model MoErging design choices. References in the leaf nodes link to sections of specific papers that implement each choice. We omit references to methods where a particular choice is not applicable.

One takeaway from the survey is that most MoErging methods can be grouped into four categories based on their routing design choices:

Classifier-based routing: Methods that train the router as a classifier using expert datasets or unseen data.
Embedding-based routing: Methods that compute embeddings of expert training sets and compare them to a query embedding for routing.
Nonrouter methods: Methods that do not explicitly train a router but instead initialize the router in an unsupervised manner.
Task-specific routing: Methods that learn a task-specific routing distribution over the target dataset to improve performance on a specific task.

While the differences within each category are minor, the differences across categories are significant because they determine the level of data access required for implementation. As a result, data access is a primary factor in determining which methods are applicable and feasible in various settings.

Our taxonomy also covers recent approaches to building agentic systems, which could be viewed as specific types of MoErging methods where experts are full language models and routing decisions are made on a step-by-step or example-by-example basis. The optimal level for MoErging may vary depending on the task and the computational resources available to each stakeholder.

Potential benefits and use cases of modular models

Modular models can unlock new benefits and use cases for AI, offering a promising approach to addressing challenges in current AI development. Moving forward, further substantial research is needed to validate this potential and assess feasibility.

Modular AI may:

Allow privacy-conscious contributions. Teams with sensitive or proprietary data, such as personally identifiable information (PII) and copyrighted content, can contribute expert models and benefit from larger projects without sharing their data. This capacity can make it easier to comply with data privacy and legal standards, which could be valuable for healthcare teams that would benefit from general model capabilities without combining their sensitive data with other training data.
Drive model transparency and accountability. Modular models allow specific expert models to be identified and, if necessary, removed or retrained. For example, if a module trained on PII, copyrighted, or biased data is identified, it can be removed more easily, eliminating the need for retraining and helping ensure compliance with privacy and ethical standards.
Facilitate model extensibility and continual improvement. Modularity supports continual improvements, allowing new capabilities from expert models to be integrated as they are available. This approach is akin to making localized edits, allowing for continuous, cost-effective improvement.
Lower the barrier to AI development for those with limited compute and data resources. Modular AI can reduce the need for extensive data and compute by creating a system where pretrained experts can be reused, benefiting academics, startups, and teams focused on niche use cases. For example, an AI agent tasked with booking flights on a specific website with limited training data could leverage general navigation and booking skills from other trained AI experts, enabling generalizable and broadly applicable skills without requiring domain-specific training data. We explore this process of transferring skills across tasks in our paper “Multi-Head Routing For Cross-Task Generalization.”
Support personalization. Modular models make it possible to equip AI agents with experts tailored to individual users or systems. For instance, AI designed to emulate five-time World Chess Champion Magnus Carlsen could enhance a player’s preparation to play a match against him. Experiments suggest that storing knowledge or user profiles in on-demand modules can match or surpass the performance of retrieval-augmented generation (RAG), potentially reducing latency and improving the user’s experience in custom AI applications.

Current limitations and looking forward

In this blog, we focused on a type of modular approach that involves training foundation models, which requires substantial compute power and large amounts of data. Despite the advantages of modularity, such as increased flexibility, efficiency, and adaptability, the development of foundation models remains resource-intensive, necessitating high-performance computing and robust datasets to support fine-tuning.

Recent work has begun to address these challenges by distributing the pretraining process of foundation models (opens in new tab). Looking ahead, a promising research direction focuses on exploring how to create a minimal dataset for training “empty foundation models” while shifting most of their capabilities to external pluggable modules.

Modular methods are evolving rapidly, and we’re excited by their potential. Modularity has the capacity to democratize AI development, improve model accountability, and support efficient continuous learning. With the MoErging taxonomy, we aim to establish a shared language that fosters engagement within the research community. This research is in the early stages, and we welcome community collaboration. If you’re interested in working with us, please reach out to ModularModels@microsoft.com.

Acknowledgements

We would like to thank paper collaborators: Prateek Yadav, Colin Raffel, Mohammed Muqeeth, Haokun Liu, Tianlong Chen, Mohit Bansal, Leshem Choshen, Edoardo Ponti, Zhan Su, Matheus Pereira, Nicolas Le Roux, Nabil Omi, Siddhartha Sen, Anurag Sarkar, Jordan T. Ash, Oleksiy Ostapenko, and Laurent Charlin.

The post Toward modular models: Collaborative AI development enables model accountability and continuous learning appeared first on Microsoft Research.

Research Focus: Week of November 11, 2024

Brenda Potts — Wed, 13 Nov 2024 17:00:00 +0000

Welcome to Research Focus, a series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft.

NEW RESEARCH

Look Ma, no markers: holistic performance capture without the hassle

Motion-capture technologies used in film and game production typically focus solely on face, body, or hand capture, requiring complex and expensive hardware and lots of manual intervention from skilled operators. While machine-learning-based approaches can overcome these challenges, they usually only support a single camera, often operate on a single part of the body, do not produce precise world-space results, and rarely generalize outside specific contexts.

In a recent paper: Look Ma, no markers: holistic performance capture without the hassle, researchers from Microsoft introduce a technique for marker-free, high-quality reconstruction of the complete human body, including eyes and tongue, without requiring any calibration, manual intervention or custom hardware. This approach produces stable world-space results from arbitrary camera rigs while also supporting varied capture environments and clothing. The researchers achieve this through a hybrid approach that leverages machine learning models trained exclusively on synthetic data and powerful parametric models of human shape and motion. They evaluate their method on a number of body, face, and hand reconstruction benchmarks and demonstrate state-of-the-art results that generalize on diverse datasets.

Read the paper

NEW RESEARCH

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Using AI agents for operational resilience of cloud services, which currently require significant human effort and domain knowledge, is a high-impact application. Interest is growing in AI for IT Operations (AIOps), which aims to automate complex operational tasks like fault localization and root cause analysis, thereby reducing human intervention and customer impact. However, achieving the vision of autonomous and self-healing clouds though AIOps is hampered by the lack of standardized frameworks for building, evaluating, and improving AIOps agents.

In a recent paper: Building AI Agents for Autonomous Clouds: Challenges and Design Principles, researchers from Microsoft lay the groundwork for such a framework by first framing the requirements and then discussing design decisions that satisfy them. The researchers also propose AIOpsLab, a prototype implementation leveraging agent-cloud-interface that orchestrates an application, injects real-time faults using chaos engineering, and interfaces with an agent to localize and resolve the faults. The paper sets the stage for building a modular and robust framework for building, evaluating, and improving agents for autonomous clouds.

Read the paper

NEW RESEARCH

Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming

AI-assisted programming offers great promise, but also raises concerns around the trustworthiness of AI-generated code. Proof-oriented languages like F* (opens in new tab) enable authoring programs backed by machine-checked proofs of correctness. Using AI to generate code and proofs in proof-oriented languages helps mitigate these concerns, while also making proof-oriented programming more accessible to people.

In a recent preprint: Towards Neural Synthesis for SMT-Assisted Proof-Oriented Programming, researchers from Microsoft and external colleagues explore using AI to automate the construction of proof-oriented programs. The researchers curate a dataset of 940,000 lines of open-source F* programs and proofs, including software used in production systems ranging from Windows and Linux to Python and Firefox. The dataset includes around 54,000 top-level F* definitions, each representing a type-directed program and proof synthesis problem. A program fragment checker queries F* to check the correctness of candidate solutions. With this dataset, the researchers explore using AI to synthesize programs and their proofs in F*, finding the performance of fine-tuned smaller language models to compare favorably with LLMs, at much lower computational cost.

Read the paper

NEW RESEARCH

One-to-many testing for code generation from (just) natural language

The mostly basic Python programs (MBPP) dataset is commonly used for evaluating natural language models on the task of code generation. Despite its popularity, the original MBPP has two major problems: it relies on providing test cases to generate the right signature and there is poor alignment between “what is asked” and “what is evaluated” using the test cases.

To address these challenges, in their recent “One-to-many testing for code generation from (just) natural language” paper, researchers from Microsoft introduce the “mostly basic underspecified Python programs” or MBUPP dataset. This dataset adapts MBPP to emphasize the natural language aspect by allowing for some syntactic ambiguity (like not specifying the return type of a function) and evaluating generated code on multiple sets of assertions (like each set covering a different return type). Besides iteratively inspecting LLM results to extend the assertions sets, the researchers carefully remove poor alignment from the instructions (like a specific algorithm to use) and perform a majority vote over slightly paraphrased instructions to improve the quality of the dataset. The researchers compare popular open and closed weight models on the original MBPP and adapted MBUPP datasets to highlight the effect of paraphrasing and new test cases on code generation evaluation. The MBUPP dataset is publicly available to encourage its use in evaluation code generation models.

Read the paper

The post Research Focus: Week of November 11, 2024 appeared first on Microsoft Research.

Preventing side-channels in the cloud

Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) — Tue, 12 Nov 2024 17:00:00 +0000

Cloud computing delivers scalable and cost-effective compute resources to a wide range of customers. The ability for cloud providers to share components of the hardware stack across customers, or tenants, is essential for running efficient cloud systems. For example, modern central processing units (CPUs) pack hundreds of physical hardware threads sharing terabytes of dynamic random-access memory (DRAM), which can be flexibly assigned to many independent virtual machines (VMs).

Preventing tenants from snooping on others who share the same hardware requires security mechanisms. Microsoft Azure (opens in new tab) provides strong protection via comprehensive architectural isolation through access control mechanisms implemented across the cloud platform, including the hardware and the hypervisor. Confidential computing (opens in new tab) powered by trusted execution environments further hardens architectural isolation via hardware memory encryption to protect tenants even against privileged attackers.

A changing threat landscape

Even with perfect architectural isolation, sharing microarchitectural resources, such as CPU caches and DRAM row buffers, can leak small amounts of information, because interference (due to sharing) leads to variations in the latency of memory accesses. This gives rise to so-called microarchitectural side-channel attacks where a malicious tenant can learn information about another tenant, in the worst case: their cryptographic keys.

Microsoft Azure protects tenants and critical infrastructure against currently practical side-channel attacks. For example, side-channels in on-core resources (e.g., buffers, predictors, private caches) are comprehensively (opens in new tab) mitigated by Hyper-V HyperClear (opens in new tab) via core scheduling, microarchitectural flushing and scrubbing, and virtual-processor address space isolation; and our cryptographic libraries are carefully hardened to prevent any secrets from being leaked via microarchitectural side-channels.

However, the threat landscape is changing. First, side-channel attacks are becoming increasingly sophisticated: For example, recent academic research (opens in new tab) has shown that even cache-coherence directories can be exploited to leak information across cores. Second, future CPUs are likely to employ increasingly sophisticated microarchitectural optimizations, which are prone to new kinds of attacks: For example, the recently introduced data-dependent prefetchers have already been found to leak information (opens in new tab).

Project Project Venice

In Azure Research’s Project Venice, we are investigating principled defenses, to be prepared in case such emerging attacks start posing a risk to Azure customers.

Preventing microarchitectural side-channels with resource-exclusive domains

Publication Principled Microarchitectural Isolation on Cloud CPUs

In a research paper (opens in new tab), which has received a distinguished paper award at the ACM Conference on Computer and Communications Security (ACM CCS’24 (opens in new tab)), we present a system design that can prevent cross-VM microarchitectural side-channels in the cloud. Our design provides what we call resource-exclusive domains, which extend the architectural abstraction of private physical threads and private memory to the microarchitectural level. That is, resource-exclusive domains guarantee isolation even against powerful attackers that try to mount side-channel attacks on shared microarchitectural resources.

Our approach builds on isolation schemes, a novel abstraction of the way a CPU shares microarchitectural structures between its physical threads. Isolation schemes can be used by the hypervisor and host operating system to assign physical threads and physical memory pages, eliminating the risk of information leakage across resource-exclusive domains. Technically, for a given assignment of physical threads to resource-exclusive domains, the isolation scheme partitions each microarchitectural resource that is shared between domains (as this would leak information), but without partitioning resources that are private to a domain (as this would affect performance). We achieve this using hardware mechanisms, if available, and multi-resource memory coloring, if not.

Publication Gaussian Elimination of Side-channels: Linear Algebra for Memory Coloring

In a complementary research paper (opens in new tab) (appearing at ACM CCS’24 (opens in new tab)), we provide the theoretical foundations and practical algorithms for computing such multi-resource memory coloring schemes for existing microarchitectures, as well as design patterns for future microarchitectures to support a large number of resource-exclusive domains.

We have implemented our approach in a research prototype based on Microsoft Hyper-V for a modern cloud chiplet-based CPU, AMD EPYC 7543P, that supports VM-level trusted execution environments. Using a collection of microbenchmarks and cloud benchmarks, we demonstrate that our approach eliminates all identified side-channels and incurs only small performance overheads. For example, when allocating resources at chiplet and channel granularity (i.e., coupling a chiplet with one of the local DRAM channels) we observe an overhead of less than 2%; and only up to 4% when allocating resources at chiplet granularity and coloring with 2MB pages.

Co-designing cloud platforms for future microarchitectural isolation

To validate the effectiveness and practicality of our approach, we inferred isolation schemes for a single CPU by reverse-engineering its microarchitecture. This approach is incomplete and does not scale to the diverse hardware fleet available in the cloud. We are working with CPU vendors to develop isolation schemes for future CPUs, which will then be exposed via the hardware interface for consumption by the hypervisor’s hardware abstraction layer. In this way, we will be able to reap the benefits of microarchitectural performance optimizations while continuing to provide strong security guarantees to cloud tenants.

Additional Contributors

Cédric Fournet, Senior Principal Researcher
Jana Hofmann, Researcher
Oleksii Oleksenko, Senior Researcher

The post Preventing side-channels in the cloud appeared first on Microsoft Research.

Collaborators: Prompt engineering with Siddharth Suri and David Holtz

Alyssa Hughes (2ADAPTIVE LLC dba 2A Consulting) — Mon, 11 Nov 2024 14:30:00 +0000

Transforming research ideas into meaningful impact is no small feat. It often requires the knowledge and experience of individuals from across disciplines and institutions. Collaborators, a Microsoft Research Podcast series, explores the relationships—both expected and unexpected—behind the projects, products, and services being pursued and delivered by researchers at Microsoft and the diverse range of people they’re teaming up with.

How significant will prompt engineering be as generative AI models continue to advance? After previous successful collaborations, Siddharth Suri, a Microsoft senior principal researcher, and David Holtz, an assistant professor at the University of California, Berkeley and a former intern of Suri’s, reunited to address the debate with data. In this episode, they discuss their study of how prompting approaches change as models advance. They share how the work required finding a variety of additional perspectives in what they describe as an Ocean’s Eleven-style recruitment effort; why mastering chain-of-thought prompting and other specialized methods might not be a prerequisite for getting what you want from a model; and, for aspiring researchers, what some butterflies can tell you about the types of challenges you’re pursuing. Suri and Holtz’s work is part of the Microsoft Research initiative AI, Cognition, and the Economy, or AICE, and is supported by the Microsoft Research initiative Accelerate Foundation Models Research, or AFMR.

Learn more:

As Generative Models Improve, People Adapt Their Prompts
Publication, July 2024
Collaborators: AI and the economy with Brendan Lucier and Mert Demirer – Microsoft Research
Microsoft Research Podcast, August 2024
Ideas: Designing AI for people with Abigail Sellen
Microsoft Research Podcast, May 2024
AI, Cognition, and the Economy (AICE)
Initiative page
Accelerating Foundation Models Research (AFMR)
Initiative page

Transcript

[TEASER]

[MUSIC PLAYS UNDER DIALOGUE]

SIDDHARTH SURI: So, it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? I felt like I was dropped in outer space, and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on?

DAVID HOLTZ: And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology is poised to have on so many different facets of, you know, life and the economy and things like that.

[TEASER ENDS]

GRETCHEN HUIZINGA: You’re listening to Collaborators, a Microsoft Research Podcast showcasing the range of expertise that goes into transforming mind-blowing ideas into world-changing technologies. I’m Dr. Gretchen Huizinga.

[MUSIC FADES]

Today I’m talking to Dr. Siddharth Suri, also known as Sid, who’s a computational social scientist and a senior principal researcher at Microsoft Research. With him is Dr. David Holtz, an assistant professor in the Haas School of Business at the University of California, Berkeley. Sid and David are co-leading a team of researchers who are exploring the fascinating world of prompt engineering as part of the AI, Cognition, and the Economy, or AICE, initiative at Microsoft Research. I can’t wait to get into the meat of this research, but before we do, let’s meet our researchers. Sid, you first!

SIDDHARTH SURI: Hey, Gretchen, thanks for having me.

HUIZINGA: Tell us about yourself. At what intersection do your research interests lie, and what path led you to what you’re doing at Microsoft Research today?

SURI: So I got to where I am now through a very long and circuitous route, and I’ll give you the sort of CliffsNotes version of it, if you will. If you start back in grad school, my dream was to become a theoretical computer scientist. And what that basically means is writing algorithms. And what that basically means is pushing Greek symbols around a page. [LAUGHTER] And it turns out I’m good at that, but I’m not great at that. And towards the end of grad school, I was working with another professor, and he was doing these experiments that involved humans, and what we would do is we bring undergraduates into a lab. They were sitting in front of a computer using our software. We’d arrange them in different networks, so you’re trying to solve a problem with the people who are next to you in this network. And then we would change the structure of that network and have them solve the problem again. And we would try to understand, how does the structure of this network affect their ability to solve this problem? And I remember analyzing this data. I just was swimming around in this data and having a grand old time. I … nights, weekends … I remember riding the bus to school in Philadelphia, and I was trying to think about new analyses I could do. And it was just so … it was fun. I couldn’t get enough. And I remember my adviser talking to me one day, and he’s like, Sid, you’re really good at this. And I responded with, really good at what? I’m just doing the obvious thing that anybody would do. And he was like, bro, this is not obvious. Like, you know, you got a knack for this. And then that, sort of, set me on this path, and then, just to make a little long story short, I don’t have tons of self-awareness. So it took me like 10 full years to go from, like, deciding to hang up being a theoretical computer scientist and understanding humans, human behavior, and using technology to understand human behavior. And that’s, kind of, where I ended up as a computational social scientist. I’ve sort of gone all in in that space, as a computational social scientist. And that’s how David and I met. He’s a rising star in that space, as well. He became my intern. And that’s how we met. I’ll let him share his origin story with you.

HUIZINGA: Well, let’s do, David. I noticed you have a strong science background, but now you’re an assistant professor in a business school. So you got to do a little dot-connecting here. How did a guy with a degree in physics and astronomy—and should I also mention theater and dance? I’m so intrigued—um, how did that guy wind up working with MBAs and economists?

DAVID HOLTZ: Yeah, thanks for having me, Gretchen. Similar to Sid, my path to where I am today is also long and circuitous, and I will try to give you the CliffsNotes version. When I was young, I was always super interested in physics, and I think what drew me to physics was the way that it combined math, which I was very good at when I was younger, and the ability to answer big existential questions. Where does the universe come from? What’s the universe made out of? Is it growing? Is it shrinking? Things like that. And so when I went to college, I didn’t think too deeply about what I was going to study. I just, sort of, you know, always wanted to do physics. I’m going to do physics. And so I majored in physics. And then … I did my undergrad at Princeton, and there’s something about the physics department at Princeton where it’s almost just assumed everyone’s going to go get their PhD. And so there was a lot of “ambient pressure” to apply to graduate school. And so I actually started my physics PhD at Johns Hopkins. And as a PhD student, I was working on these large telescopes that look at remnant light from right after the Big Bang and try to characterize, you know, tiny fluctuations in this field of light that fills the night sky in a wavelength-like range that is not visible to the human eye. And by, sort of, characterizing those fluctuations in the light field, you can learn things about what the universe is made out of and how it’s evolving and all these types of things. It all sounds very cool. But the teams that conduct this research at this point are really big. It’s like you’re in a company, essentially. So there’s a hundred people working on building this telescope, analyzing these telescopes, so on and so forth. And so the actual day to day of my life as a physics PhD student was really far removed from the big existential questions that I was actually really interested in. My PhD dissertation probably would have been developing a system that moved a mirror in exactly this way so that light polarization appears, you know, in the experimental apparatus. You’re basically doing an engineering degree. And on top of all that, like Sid, I was good at physics, but I think I realized I was not great at physics. And I saw a lot of people around me in my classes and in my labs that were great at physics and moreover were having a really hard time finding a job as a physics professor after they graduated despite being great at physics. And so I started having these realizations during graduate school and had never done anything really except physics and so took a leave of absence and actually came out to the Bay Area and started working out here in advertising, which is not something that I’m necessarily super excited about—and as a product manager, which is not what I do. But it was kind of the hop that I needed to try something different. And after some amount of time, moved from doing product management to doing data science. This was right when the data science boom was starting. I think the year that I came to the Bay Area, DJ Patil, who used to be the chief data scientist for the US, had written this very famous HBR article about, you know, how data science was the sexiest job of the 21st century …

HUIZINGA: Right!

HOLTZ: … so I, kind of, took my physics credentials and became a data scientist and eventually also moved out of advertising and went and worked at Airbnb, which at the time was growing really quickly and, you know, was sort of a young company where a lot of exciting things were happening. You know, I loved working at Airbnb. I learned a lot. I met a lot of interesting people. I learned a lot working in ad tech, as well, and eventually just found myself feeling pulled back to academia. Like, I really liked the questions that I was working on, the types of work that I was doing. Similar to Sid, I found that I was really good at analyzing data. I didn’t feel like I was doing anything particularly crazy, but people around me were saying, no man, you’re really good at this! And so I started looking for PhD programs where I could do the type of work that I was doing as a data scientist at Airbnb but in a more academic environment. And that, sort of, naturally led me to PhD programs in business schools. I didn’t know what a PhD in a business school entailed, but there were professors in those departments that were doing the research that I wanted to do. And so that’s how I ended up there. And so my research when I started out as a PhD student was, I think, relative to a lot of people, I didn’t start from, like, first principles. I don’t know that I necessarily had this one little thing that I was super interested in. I was really interested in solving applied problems and, in particular, I think some of the applied problems that I had seen out in the world working in tech. And over time, I think I found that I’m just really interested in new technologies and how those technologies affect, you know, the flow of information, how people collaborate, what happens to the economy, so on and so forth. And so I sort of started by just trying to answer a few problems that were in front of me and discovered this was kind of, you know, sort of the unifying theory of the things that I was interested in studying. And I think … you know, in hindsight, I think one thing that is true that has kind of guided, you know, my path—and this connects back to the theater and dance, you know, minor that you had alluded to earlier—is I’ve always been a really social person. I’ve always been really interested in humans and how they interact. I think that type of storytelling is really at the crux of, you know, theater and music and things like that. And when I was younger, for sure, I spent a lot of time writing music, playing music, doing improv comedy, performing on stage. And as a physicist, that itch wasn’t necessarily getting scratched, both because I was just studying, you know, extremely small particles and was doing it in a pretty lonely lab. And a nice thing about being a computational social scientist is that I’m studying humans, which is really interesting. I think it plugs into something that I’m really passionate about. And a cool thing about getting to do that in particular in a business-school setting, I think, is that, you know, I’m talking often to people at companies and, you know, lecturing to MBA students, who are really outgoing, gregarious people. And so it presents a really nice opportunity to, kind of, fuse, you know, my interest in science and information and technology with that other interest in humans and connection and, you know, the opportunity to, sort of, interact with people.

HUIZINGA: Yeah, yeah. Well, escaping from middle management in physics is probably a good thing … Well, before we get into the details of your collaboration on prompt engineering, let’s make sure everyone knows what we’re talking about. Sid, when we talked before, I told you, to be honest, when I first heard the phrase “prompt engineer” a couple years ago, I laughed because I thought it was a joke, like sanitation engineer. Then when I heard it was a real job, I laughed a little bit less. And then when I heard it was not only a real job but one that, if you were good at it, could pay six figures, I stopped laughing altogether and started paying attention. So I’d like you, Sid, to give us a brief history of prompt engineering. What is it, when and how did it become a thing, and why is it different from anything I’d do in garden-variety internet search?

SURI: So generative AI wants to do just that. It wants to generate something for you. But how do you express what you want? What do you want the system to give you? And the answer is a prompt. So I’ll give you an example. Whenever there’s a new model out there, especially one that generates images, a prompt I use—you might laugh at this—is, “Show me a picture of Bruno Mars on the surface of Mars eating a Mars bar.” [LAUGHTER] And the reason why I use that prompt is because Mars bars aren’t in the training data. There’s not a lot of pictures of Mars in the training data. And everybody knows who Bruno Mars is. So that’s me describing to the model what I want. That is a prompt. Show me a picture with these elements in it, OK? But this is where the hard part starts. It sends you something. Oh. I didn’t want Mars to be that color of red. Could you change it to a deeper red or more of an orange? OK. Now, could you put a little dust in the atmosphere? OK. Well, I want a moon in the background. I didn’t know I wanted a moon in the background, but now I do. Where’s the sun in this image? I don’t know. And then the whole thing, kind of, becomes much more rich and a much bigger exploration compared to, say, putting keywords into a search engine. It’s a really much more rich space to explore. Now you asked me … a part of your question was, why is prompt engineering difficult? It’s difficult for a number of reasons. Number one, you don’t always know what you want.

HUIZINGA: Yeah …

SURI: And so it’s that conversation with the system to figure that out. Number two, you might not be expressing what you want as clearly as you think you are.

HUIZINGA: Right …

SURI: Number three, the problem could be on the receiver end. These models are new. You might be expressing it clearly, but they might not be understanding what you’re saying as clearly as you would hope. And then the fourth reason is the one I just said, which is, like, what you’re asking for is not just like, “Give me a document relevant to these keywords,” or “Give me some information relative to these keywords,” as you would do in traditional search. You’re asking for something much more rich. And to get that richness that you were hoping for requires this prompt. And that requires an exploration of the idea in your head and an expression of that idea in the real world. So that’s what prompt engineering is, and that’s why it’s hard.

HUIZINGA: OK, and when would you say it became a thing? I mean, prompt engineer is an actual job, but it was a thing first, right? It didn’t start out to be a job; it started out to be something you did, so …

SURI: So when these models came out, you know, what was it, late, around 2020, late 2020, I think, when they first started becoming popular. So prompting had been around in academia a few years prior to that, but it first hit the mainstream when these models, sort of, first came out around 2020, and why … why this job? Why this six-figure salary? Why all … what’s all the hoopla about it? And like I said before, these systems are new. No one knew how to use them. No one knew how to express what they want, A. B, there’s a lot of arcane ways to prompt that aren’t obvious at the beginning. Like, I’ll give you a few examples. One way to prompt is to give the system examples of what you’re looking for. Say you want something to classify an email as spam or not spam. You might give it a few emails that are spam and a few emails that are not spam and say, hey, if it’s more like this, call it spam; if it looks more like that, call it not spam. And so that’s one example. Another example would be like, OK, I’m a small-business owner. I need some advice. This is the problem I’m facing. Give me some advice to solve this problem as if you were Bill Gates.

HUIZINGA: Oh …

SURI: That’s, like, adopting a persona. That’s another example. A third example would be, like, OK, you have a math problem. You’re trying to solve this math problem, and to get it done correctly, some of these systems need what’s known as chain-of-thought prompting, which is tell me all the steps you’re going through to solve this problem. Don’t just give me the answer 17. Give me all the steps you needed to get to 17. And that helps the system guide it, more likely, towards a correct answer. And so these are all arcane, esoteric methodologies to getting one of these models to give you the right answer, the answer you want. And being a prompt engineer means you’re an expert in these things and you’re more likely to get these correct answers than maybe someone off the street who isn’t familiar with these techniques.

HUIZINGA: Right, right, right. Well, we’re going to talk a lot more about technique and the research that you did. And you’ve alluded to, at the beginning here, a visual, like describing … I heard graphic designers hearing the client when you were talking about it: “I didn’t want that red. Maybe put the moon in …” [LAUGHS]

SURI: Yeah, exactly!

HUIZINGA: Can you just tell me what you want to begin with? No, apparently not. But you’re also talking about verbal prompts and writing and so on. So we’ll get into that in a bit. But I want to go over and talk a little bit more about this research and why it’s where it is. This episode is the latest in our “series within a series” on AI, Cognition, and the Economy at Microsoft Research. And so far, we’ve talked about the impacts of AI on both cognition with Abi Sellen and the economy with Mert [Demirer] and Brendan [Lucier]. You can look up those episodes, fantastic episodes. This topic is a little less obvious, at least to me. So, David, maybe you could shed some light on how research for prompt engineering became part of AICE and why it’s an important line of research right now.

HOLTZ: So I think this project relates to both cognition and the economy. And let me lay out for you the argument for both. So first, you know, I’m not a cognitive scientist, but I think there are some interesting questions around how people, and in particular common people who are not computer scientists, conceive of and interact with these models, right. So how do they learn how to prompt? Do they think about different generative models as all being the same, or are they sort of developing different prompting strategies for different models? What are the types of tricks that they discover or use when they’re prompting models? And at the time that we started working on this project, there wasn’t a lot of research on this and there wasn’t a lot of data on this. You know, the data that existed typically is on the servers of big companies like Microsoft. It’s not really available to the public or to many researchers. And then the research is all, you know, sort of disproportionately focused on these esoteric prompting strategies that Sid mentioned, like chain-of-thought prompting, which are useful but are not things that, you know, my family members that are not scientists are going to be using when they’re trying to interact with, you know, the latest large language model that has been launched. So that was one draw of the project. The other thing that I think is interesting—and the reason that this project was well-suited to the AICE program—is that around the time that we were starting to work on this project, a bunch of research was coming out, and I’ve contributed to some of this research on a different project, on the impacts that generative AI can have on different economic outcomes that we care about. So things like productivity and job performance. And one interesting pattern that has emerged across numerous different studies trying to answer those types of questions is that the benefits of generative AI are often not uniform. Usually, generative AI really helps some workers, and there are other workers that it doesn’t help as much. And so there’s some interesting questions around why is it that some people are able to unlock big productivity gains using generative AI and others can’t. And one potential reason for this is the ways that people prompt the models, right. So I think understanding how people are actually interacting with these models when they’re trying to do work is a big part of understanding the potential impact that these models can have on the economy.

HUIZINGA: OK, it’s “how I met your mother” time. Let’s talk for a minute about how you two came to be working, along with what you’ve referred to as a “crack team” of researchers, on this study. So, Sid, why don’t you tell us, as you remember it, who called who, how the conversation went down, and who’s all involved. And then David can confirm, deny, or add color from his perspective.

SURI: OK, I need you to mentally rewind back to, like, November 2020. So it’s, like, just before Thanksgiving 2020. My manager came to me, and she was like, Sid, we need somebody to understand, what are the effects of AI on society? And I was like, “Oh, yeah, small question! Yeah, I can do that by myself! Yeah. I’ll get you an answer by Tuesday,” OK? Like, what the heck, man? That was like one of the biggest questions of all time. The first thing I did was assemble a team. We write an agenda; we start going forward from there. You know, Scott Counts is a colleague of mine; he was on that team. Not long after that … as I had mentioned before, David was my intern, and he and I started brainstorming. I don’t remember who called who. Maybe David does. I don’t remember that. But what I do remember is having several fun, productive brainstorming conversations with him. I remember vividly, I was, like, sort of walking around my house, you know, upstairs, kind of, trying to bounce ideas off of him and get the creative juices flowing. And one of the things we were talking about was, I just felt like, again, this is early on, but prompting is the thing. Like, everybody’s talking about it; nobody knows how to do it; people are arguing. So David and I were brainstorming, and then we came up with this idea of studying prompting and how prompting changes as the models get better and better, which they are, at a torrid rate. And so that was our, sort of, key question. And then David actually was primarily involved in assembling the crack team, and he’s going to talk more about that. But as a side note, it’s really cool for me to see David, kind of, grow from being, you know, just a great, sort of, individual scientist to, like, the leader of this team, so that was, kind of, a cool thing for me to see.

HUIZINGA: Hmm. You know, you tell that story … Peter Lee, who’s the president of Microsoft Research, tells a similar story where a certain CEO from a certain company came and dropped him in the middle of the AI and healthcare ocean and said find land. So did it have that same sort of “overwhelmed-ness” to it when you got asked to do this?

SURI: Overwhelmed would be an understatement! [LAUGHTER] It was overwhelming to the point where I was borderline afraid.

HUIZINGA: Oh, dear!

SURI: Like, you know, Peter has this analogy you mentioned, you know, “dropped in the ocean, find land.” I felt like I was dropped in outer space and I had to find Earth. And I didn’t even … I couldn’t even see the sun. Like, I … there was this entirely new system out there. No one knew how to use it. What are the right questions to ask? We were using the system to study how people use the system? Like, what the heck is going on? This was, like, stress levels were on 12. It was a sort of wild, white-knuckle, anxiety-inducing, fun, intense ride. All of those emotions wrapped up together. And I’m happy it’s over [LAUGHS] because, you know, I don’t think it was sustainable, but it was an intensely productive, intensely … again, just in case there’s any budding scientists out there, whenever you’re like swimming around in a problem and your gut is a little scared, like, I don’t know how to do this. I don’t know if I’m doing this right. You’re probably working on the right problem. Because if you know how to do it and you know how to do it right, it’s probably too easy.

HUIZINGA: Yeah!

SURI: And in this moment, boy, my gut was telling me that nobody knows how to do this and we got to figure this out.

HUIZINGA: Right. David, from your theater background, did you have some of these same emotions?

HOLTZ: Yeah, I think so. I think Sid and I, it’s interesting, we have different perspectives on this kind of interesting generative AI moment. And to use the theater analogy, I think being, you know, like, a researcher at Microsoft, Sid has kind of been able, the whole time, to see behind the curtain and see everything that’s going on. And then as someone that is, you know, a researcher in academia, I’ve sort of been in the audience to some extent. Like, I can see what’s coming out onto the stage but haven’t seen all the craziness that was happening behind the curtain. And so I think for me, the way that I would tell the story of how this project came together is, after I had finished my internship and Sid and I—and a number of coauthors—had this very successful remote work paper, we just kept in touch, and every few weeks we’d say, hey, you know, want to chat, see what we’re both working on, swap research ideas?

HUIZINGA: Yeah …

HOLTZ: And for me, I was always looking for a way to work together with Sid. And if you look around at, you know, the history of science, there’s these Kahneman and Tversky, like, Watson and Crick. Like, there are these teams that stay together over long periods of time and they’re able to produce really amazing research, and so I realized that one thing that I should prioritize is trying to find people that I really like working together, that I really click with, and just trying to keep on working with those people. Because that’s one of the keys to having a really successful career. At the same time, all this generative AI stuff was happening, and I went to a few talks. One of them was on the Berkeley campus, and it was a talk by someone at Microsoft Research, and it was about, sort of, early signs of how amazing, you know, GPT-4 was. And I remember thinking, this seems like the most important thing that a person could be working on and studying right now. Like, anything else that I’m working on seems unimportant in comparison to the impact that this technology …

HUIZINGA: Wow …

HOLTZ: … is poised to have on so many different facets of, you know, life and the economy and things like that. And so I think things kind of came together nicely in that there was this opportunity for Sid and I to work together again and to work together again on something that we both agreed was just so incredibly important. And I think we realized this is really important. We really want to work on this problem. But we’re also both super busy people, and we don’t necessarily have all the skills that we need to do this project. And given how important this question is and how quickly things are moving, we can’t afford to have this be a project where it’s like, every now and then … we come back to it … maybe we’ll have a paper in, like, three years. You know, like, things needed to happen really quickly. And so that’s where we got to thinking, OK, we need to put together a team. And that’s kind of where this, like, almost, like, Ocean’s Eleven, sort of, scene emerged [LAUGHTER] where we’re like, we’re putting together a team. We need a set of people that all have very particular skills, you know, and I’m very lucky that I did my PhD at MIT in this sort of community that is, I would say, one of the highest concentrations of really skilled computational social scientists in the world, basically.

HUIZINGA: Wow.

HOLTZ: And so I, sort of, went to, you know, to that community and looked for people. I reached out to people that I had met during the PhD admissions program that were really promising, you know, young PhD students that might want to work on the project and, sort of, put the team together. And so this project is not just Sid and I. It’s six other people: Eaman Jahani, Ben Manning, Hong-Yi TuYe, Joe Zhang, Mohammed Alsobay, and Christos Nicolaides. And everyone has brought something unique and important to the project. And it’s really kind of crazy when you think about it because on the one hand, you know, sometimes, when we’re talking, it’s like, wow, eight people. It’s really a lot of people to have on a paper. But at the same time, you, kind of, look at the contributions that every single person made to the project and you, kind of, realize, oh, this project actually could not have happened if any one of these people were not involved. So it’s been a really interesting and fun project in that way.

SURI: One thing I just wanted to add Gretchen is, I’m a little bit older than David, and when I look back at my career and my favorite projects, they all have that property that David was alluding to. If you knocked one of the coauthors off that project, it wouldn’t have been as good. To this day, I can’t figure out why is that so important, but it is. It’s just this notion that everyone contributed something and that something was unique that no one else would have figured out.

HUIZINGA: Well, and the allusion to Ocean’s Eleven is exactly that. Like, they have to get someone who can crack a safe, and they have to get someone who’s a contortionist and can fit into a box that no one can see, and blah, blah, blah. And I don’t know if you’ve argued about which one of you is George Clooney and which one of you is Brad Pitt, but we’ll leave that for a separate podcast.

SURI: Well, actually … [LAUGHTER] it’s not even a question because Eaman Jahani is by far the most handsome one of us, so he’s Brad Pitt. It’s not even close. [LAUGHS]

HUIZINGA: David’s giggling!

HOLTZ: Yeah, I think Sid … I’d agree with that. I think Sid is probably George Clooney.

SURI: I’ll take it. I’ll take it!

HUIZINGA: Anytime! Well, we’ll talk about some more movies in a minute, but let’s get into the details of this research. And, Sid, I was looking at some of the research that you’re building on from your literature, and I found some interesting papers that suggest there’s some debate on the topic. You’ve just alluded to that. But let’s talk about the titles: AI’s hottest job: Prompt engineer, and, like, Tech’s hottest new job: AI whisperer. No coding required. But then there’s this Harvard Business Review article titled AI prompt engineering isn’t the future. And that left me wondering who’s right. So I suspect this was part of the “prompting” for this research. Tell us exactly what you did and how you did it.

SURI: Sure, so where we came to this question was, we came at it from a couple directions. One is what you just said. There’s this conversation going on in the public sphere, which is on the one hand, there’s these jobs; there’s this notion that prompting, prompt engineering, is a super important thing; it’s paying six figures. On the other hand, there’s also this notion that these models are getting better and better. They’re more able to figure out what you needed and guess what you needed and so maybe we’re not going to need prompting going forward.

HUIZINGA: Right.

SURI: And David and I were like, this is perfect. One of my mentors, Duncan Watts, I always joke with him that every introduction of our paper is the same. It’s “There’s this group of people that say x, and there’s this group of people that say the opposite of x. So we did an experiment to figure it out.” And the reason why every introduction of one of my papers is the same is because you can never say at the end it was obvious. If it was so obvious, then how come there’s two groups of people disagreeing on what the outcome’s going to be? So what we did in the experiment—it’s very simple to explain—is we gave people a target image, and then they randomly either got DALL-E 2 or DALL-E 3. And we said, “OK, write a prompt to generate this target image that we’ve given you,” and we give them 10 tries. “And you can iterate; you can improve; you can experiment. Do whatever you want.” And the notion was, as models progress, what is the relationship between people’s ability to prompt them to get to the target?

HUIZINGA: That’s the end of it. [LAUGHS]

SURI: Yeah. [LAUGHS]

HUIZINGA: That’s the most succinct explanation of a research study that I’ve ever heard. Congratulations, Sid Suri! So I have a question, and this is like … you’ve talked a bit already about how you iterate to get to the target image. My experience is that it can’t remember what I told it last time. [LAUGHTER] So if I put something in and then I say, well, I want you to change that, it starts over, and it doesn’t remember what color red it put in the first image. Is that part of the process, or are these models better than what I’ve done before?

SURI: The models are changing, and that is … and, sort of, the history, the context, the personalization is what you’re referring to. That is coming online in these models already and in the near future. Maybe at the time we did the study, it wasn’t so common. And so they were suffering the same issue that you just alluded to. But going forward, I do expect that to, sort of, fade away a little.

HUIZINGA: OK. Well, David, Sid’s just given us the most beautifully succinct description of people trying to get the model to give them the target image and how many tries they got. What did you find? What were the big takeaways of this research?

HOLTZ: So let me start out with the most obvious finding that, you know, like, Sid was saying, ideally, you know, you’re, kind of, answering a question where it makes sense that people are on both sides of this argument. One thing that we looked at that you’d be surprised if there was someone on the other side of the argument is, OK, do people do a better job when we give them the better model? If we give them DALL-E 3 instead of DALL-E 2, do they do a better job of re-creating the target image? And the answer is unsurprisingly, yes. People do a better job when we give them the better model. The next thing that we looked at—and this is where I think the results start to get interesting—is why do they do better with the better model? And there’s a couple of different reasons why this can be the case. The first could be that they’re writing the exact same prompts. They interact with the model exactly the same, whether it’s DALL-E 2 or DALL-E 3, and it’s just the case that DALL-E 3 is way better at taking that input and translating it into an image that is the image that you had in mind with that prompt. So, you know, sort of, imagine there’s two different artists. One is like a boardwalk caricature artist; the other one is Vincent van Gogh. Like, one of them is probably going to be better at taking your input and producing a really high-quality image that’s what you had in mind. The other possibility is that people, sort of, pick up on the fact that one of these models is different than the other. Maybe it’s more expressive. Maybe it responds to different types of input differently. And as you start to figure that out, you’re going to actually prompt the model, kind of, differently. And so I think the analogy I would draw here is, you know, imagine that you’re driving a couple of different cars maybe, like, one has really nice power steering and four-wheel drive and things like that. The other one doesn’t have all these cool features. You know, you’re probably going to actually handle that car a little bit differently when you take it out on the road relative to a really simple car. And what we find when we actually analyze the data is that both of these factors contributes to people doing better with the higher-quality model. And they actually both contribute equally, right. So insofar as people do better with DALL-E 3, half of that is because DALL-E 3 is just a better model at, like, taking the same input and giving you, like, an image that’s closer to what you had in mind. But the other half is due to the fact that people, sort of, figure out on their own, oh, this model is different. This model is better. It can maybe respond to my inputs a little bit more expressively. And they start prompting differently. And one thing that’s really neat and interesting about the study is we didn’t tell people whether they were given DALL-E 2 or DALL-E 3. So it’s not even like they said, oh, you gave me the good model. OK, let me start prompting differently. They kind of just figure this out by interacting with the tool and kind of, you know, realizing what it can do and what it can’t do. And specifically when we look at what people are doing differently, they’re, kind of, writing longer prompts; they’re writing more descriptive prompts. They have way more nouns and verbs. They’re kind of doing less feeling around in the dark and kind of finding, like, a way of interacting with the model that seems to work well. And they’re kind of doubling down on that way of interacting with the model. And so that’s what we saw. And so when it connects back to your question of, you know, OK, prompt engineering, like, is it here to stay, …

HUIZINGA: Yeah.

HOLTZ: … or is prompt engineering going away? I think one way that we think about interpreting these results is that the prompts do matter, right. Like, if you didn’t think about how to prompt different models and you just wrote the same prompts and left that prompt “as is” for, you know, months or years, you’d be missing out on tons of the gains that we stand to experience from these new, more powerful models because you need to update the prompts so that they take advantage of the new model capabilities. But on the flip side, it’s not like these people needed to, you know, go read the literature on all these complicated, esoteric prompting strategies. They kind of figured it out on their own. And so it seems like prompting is important, but is it necessarily prompt engineering, where it’s this really, you know, heavy-duty, like, thing that you need to do or you maybe need to go take, like, a class or get a master’s degree? Maybe not. Maybe it’s just a matter of people interacting with the models and, kind of, learning how to engage with them.

HUIZINGA: Well, David, I want to ask you another question on that same line, because AI is moving so fast on so many levels. And it’s still a relatively new field. But now that you’ve had some time to reflect on the work you just did, is there anything that’s already changed in the conversation around prompt engineering? And if so, what are you thinking about now?

HOLTZ: Yeah. Thanks for the question. Definitely things are changing. I mean, as Sid mentioned, you know, more and more the way that people interact with these models, the models have some notion of history. They have some notion of context. You know, I think that informs how people are going to write prompts. And also, the types of things that people are trying to do with these models is constantly changing, right. And so I think as a result, the way that we think about prompting and, sort of, how to construct prompts is also evolving. So I think the way that we think about this study is that it’s by no means, you know, the definitive study on prompt engineering and how people learn to prompt. I think everyone on our team would agree there’s so much more to do. But I think the thing that struck us was that this debate that we mentioned earlier, you know, is prompting important? Will prompt engineering stay? Maybe it doesn’t matter? It was really a debate that was pretty light on evidence. And so I think the thing that we were excited to do is to sort of, you know, start to chip away at this big question with data and with, you know, an experiment and just try to start developing some understanding of how prompting works. And I think there’s tons more to do.

HUIZINGA: Right, right, right.

SURI: Just to add to that …

HUIZINGA: Yeah, please.

SURI: Again, if there’s any sort of young scientists out there, one of the things I hate doing with other scientists is arguing about what’s the answer to this question. So what I always do when there’s an argument is I just shift the argument to instead of arguing about is this question going to be yes or no, is what’s the data we need to answer the question? And that’s where David and I, sort of, came in. There was this argument going on. Instead of just arguing between the two of us about what we think it’s going to be, we just shifted the conversation to, OK dude, what data do we need to gather to figure out the answer to this question? And then boom, this project was off and running.

HUIZINGA: You know, that could solve so many arguments, you know, in real life, just like, you don’t know and I don’t know, why are we arguing? Let’s go find out.

SURI: Yeah, so instead of arguing about who knows what, let’s argue about what’s the data we need so that we’ll be convinced!

HUIZINGA: Well, on that line, Sid, another paper in the literature that you looked at was called The prompt report: A systematic survey of prompting techniques. And we’ve talked a little bit about what those techniques involve. But what has your research added to the conversation? Specifically, I’m interested to know, I mean, we did talk about tricks, but is there coaching involved or is this just sort of feel-your-way-in-the-dark kind of thing? And how fine is the line between what you referred to as alchemy and chemistry in this field?

SURI: The alchemy and chemistry analogy was David’s brilliant analogy, and what he was saying was, way back when, there was alchemy, and then out of that grew chemistry. And at the moment, there’s these, sort of, niche, esoteric ways of prompting—chain-of-thought, embody a persona, this kind of thing. And how are those going to get propagated out into the mainstream? That’s how we go from alchemy to, sort of, chemistry. That was his brilliant analogy. And there’s several punchlines of our work, but one of the punchlines is, people can figure out how to take advantage of the new capabilities of these models on their own, even when they don’t know the model changed. So that’s a great democratization argument.

HUIZINGA: Hmm …

SURI: That, OK, you don’t need to be the six-figure Silicon Valley hotshot to figure this out. That maybe, maybe everyone in the world who has access—who has internet access, electricity, and access to one of these models—they can sort of pick themselves up by their own bootstraps, learn how to use these things on their own. And I want to go back to an analogy you said a while ago, which was the analogy to traditional internet search, …

HUIZINGA: Yeah.

SURI: OK? People forgot this, but we’ve learned how to search over the course of about 30 years. I’m 45 years old, so I remember the early search engines like AltaVista, Lycos, things like that. And basically, getting anything useful out of them was pretty much impossible. I really wanted to swear right there, but I didn’t. [LAUGHTER] And what people forgot, people forgot that they didn’t know how to ride a bike, OK? And they forgot that we didn’t actually know … these systems didn’t work that well; we didn’t know how to query them that well; we didn’t know how to get anything useful out of them. And then 30 years later, no one thinks about searching the internet as a thing we do. It’s like turning on the faucet. You just do it. It’s taken for granted. It’s part of our workflows. It’s part of our daily life. We do it without thinking about it. Right now, we’re back in those AltaVista/Lycos days, like, where, you know, it’s still esoteric. It’s still niche. We’re still not getting what we need out of these models. The models are going to change. People are going to get better at it. And part of what we’re arguing in our paper is that people can get better at it on their own. All they need is access and a few tries and they figure it out.

HUIZINGA: Right. You know what’s really funny is, I was trying to find some information about a paper on Sparks. That’s the Sparks paper. And I was doing some internet search, and I wasn’t getting what I wanted. And then I moved over to ChatGPT and put basically the same question, but it was a little more question-oriented instead of keywords, and it gave me everything I was looking for. And I thought, wow, that’s a huge leap from even … that I could use ChatGPT like a search engine only better. So … well, listen, anyone who’s ever listened to my podcast knows I’m borderline obsessed with thinking about unintended consequences of technical innovation, so I always ask what could possibly go wrong if you got everything right. But as I’ve said on this series before, one of the main mandates of AICE research is to identify unintended consequences and try to get ahead of them. So, David, rather than talking about the potential pitfalls of prompt engineering, instead talk about what we need to do to keep up with or keep ahead of the speeding train of generative AI. And by we, I mean you.

HOLTZ: Yeah, I mean, I think the thing to keep in mind—and I think this has come up a couple of times in this conversation already—is at least right now, and presumably for the foreseeable future, you know, generative AI is moving so fast and is also not a monolith, right. Like, I think we tend to talk about generative AI, but there’s different types of models, even within a particular class of models. There’s so many different models that are floating around out there. And so I think it’s important to just keep on sort of revisiting things that we think we already know, seeing if those things remain true. You know, I think from a research perspective, like, kind of, answering the same questions over and over with different models over time and seeing if the results stay the same. And I think that’s one of the big takeaways from, like, sort of, a policy or applications perspective from our research, as well, is that just generative AI is moving really quickly. These models are evolving, and the way that we interact with them, the way that we prompt them, needs to change. So if you think about it, you know, there are many tech companies, many startups, that are building products or building entire, you know, companies on, basically, on top of API calls to OpenAI or to Anthropic or something like that. And behind the scenes, those models are changing all the time, whether it’s, you know, sort of a publicly announced shift from GPT-3.5 to GPT-4 or whether it’s the fact that maybe, you know, GPT-4 is kind of being tweaked and adjusted, you know, every couple of weeks based on things that are happening internally at the company. And one of the takeaways from our research is that, you know, all those tweaks are actually pretty meaningful. The prompts that you wrote two weeks ago might not be as effective you know today if they aren’t as well suited to the to the newest, latest, greatest model. And so I think just being really cognizant of that moving target, of the fact that we are living through, sort of, like, very exciting, unprecedented, crazy times and kind of just staying alert and staying on our toes is I think probably the most important thing.

HUIZINGA: Yeah. You know, when I was thinking about that question, I, my mind went to the Wallace & Gromit … I don’t know if you’re familiar with those animations, but there’s a scene where they’re on a toy train track chasing a criminal penguin, and they run out of track and then Gromit miraculously finds spare track. He starts laying it as the train is going. And it sort of feels like there’s a little bit of that in your research! [LAUGHS] I usually ask my guests on Collaborators where their research is on the spectrum from lab to life. But you’ve actually completed this particular study, and it leans more toward policy than product. And again, we’ve talked about a lot of this. Sometimes there seems to be a Venn diagram overlap with my questions. But, Sid, I want to know from your perspective, what would be a good outcome for this particular study, in your mind?

SURI: So AI systems are more and more being embedded in the workflows of companies and institutions. It used to just be all software, but now it’s specifically custom-built software, AI systems, and their prompts. I see it all the time here at Microsoft. It’s part of our workflows. It’s part of our products. It’s part of our day-to-day life. And as the models are getting better and better and these prompts are sort of embedded in our systems, someone’s got to pay attention to those prompts to make sure they’re still behaving the way we thought they were because they were written for an older version, the model changed, and now is that new model interpreting that prompt in the same way? That’s one question. The second question is, well, the new model has new capabilities, so now can you boost these prompts to take advantage of those new capabilities, to get the full economic gain, the full productivity gain of these new models? So you want to get your value for your money, so you need to adjust your prompts in response to those new models to get the full value. And part of the point of this paper is that that’s actually not that big a deal. That, as the models get better and better, even when people don’t know about it, they can still take advantage of the new affordances, the new capabilities, even when they aren’t made aware that, hey, it does a different thing right now.

HUIZINGA: Interesting.

SURI: But the point we’re making with this paper is, you have to pay attention to that.

HUIZINGA: OK, it’s last word time and I want to go a little off script with you two for this show. NVIDIA’s co-founder and CEO Jensen Huang recently said, and I paraphrase Willie Nelson here, “Mamas don’t let your babies grow up to be coders.” In essence, he’s predicting that AI is going to do that for us in the future and people would be better served pursuing different educational priorities. So that’s a bold claim. Do you guys want to make a bold claim? Here’s your chance to make a pithy prediction from your perch in research. What’s something you think will be true some years out? You don’t have to say how many years, but that you might have been reluctant to say out loud for fear that it wouldn’t age well. Remember, this is a podcast, not a paper, so no one’s going to hold you to your word, but you might end up being prophetic. Who knows? David, you go first, and then Sid can close the show. Tell us what’s going to happen in the future.

HOLTZ: I’m not sure how bold of a prediction this is, but I think there’s a lot of concern right now about the impact that AI will have in various creative domains, right. As generative AI gets better and AI can produce images and music and videos, you know, what will happen to all of the people that have been making a living creating this type of content? And my belief is that, if anything, as we just get flooded with more and more AI-generated content, people are going to place a really heavy premium on content that is produced by humans. Like, I think so much of what people value about art and creative output is the sort of human connection and the idea that something sort of emerged from someone’s lived experiences and hardships. I mean, this is why people really like reading, you know, the curator’s notes when they go to a museum, so that they can kind of understand what’s behind, you know, behind the image. And so I think generative AI is going to be really amazing in a lot of ways, and I think it will have really big impacts that we’ll need to deal with as a society in terms of how it affects work and things like that. But I don’t think that we’re moving towards a future where, you know, we’re all just consuming AI-generated, you know, art all the time and we don’t care at all about things being made by people.

HUIZINGA: You know, there’s a podcast called Acquired, and they talked about the brand Hermès,which is the French luxury leather company, and saying that to get a particular kind of bag that’s completely handmade—it’s an artifact from a human—that’s why you pay tens of thousands of dollars for those instead of a bag that comes off a factory line. So I like that. Sid, what do you think?

SURI: So I’m going to make two points. David made the argument about AI affecting the creative space. I want to zoom in on the knowledge workspace.

HUIZINGA: Hmm …

SURI: And one of the big issues in knowledge work today is it’s incredibly difficult still to get insights out of data. To give you an example, in the remote work study that David and I did, it took a handful of PhDs, tons of data, two years, sophisticated statistical techniques to make sense of what is the effect of remote work on information workers, OK? And I feel, where I see knowledge work going is there’s going to be this great democratization on how to get insights out of data. These models are very good at classifying things, summarizing things, categorizing things. Massive amounts of data. In the old days, you had to like basically be an advanced statistician, be an advanced machine learning person, train one of these models. They’re very esoteric. They’re very arcane. They’re very hard to use. And then unleash it on your data. Now if you just know how to prompt a little bit, you can get these same insights as a professional statistician would a few years ago in a much, much shorter time, you know, one-tenth of the time. So I feel like there’s going to be this great democratization of getting insights out of data in the knowledge workspace. That’s prediction number one. And then the second point I wanted to make, and I want to give a little credit to some of the academics who’ve inspired this notion, which is Erik Brynjolfsson and David Autor, and that is this: I think a lot of people are looking for the impact of AI in kind of the wrong way. Rewind in your mind back to the time when, like, the internal combustion engine was invented. OK, so we used to get around with horses; now we have cars. OK, horses went 20 miles an hour; cars go 40 miles an hour. OK, big deal. What no one foresaw was there’s going to be an entire aviation industry that’s going to make it possible to do things we couldn’t do before, speed up the economy, speed up everything, add trillions of dollars of value to the world. And I feel like right now everyone’s focusing on AI to do things we already know how to do. And I don’t think that’s the most interesting use case. Let’s instead turn our attention to, what could we not do before that we can do now?

HUIZINGA: Right.

SURI: And that’s where the really exciting stuff is. So those are the two points I’d like to leave you.

HUIZINGA: I love it. I hope you’re not saying that I could rewind my mind to when the internal combustion engine was developed …

SURI: No, no. Present company excluded! [LAUGHTER]

HUIZINGA: Oh my gosh. Sid Suri, David Holtz, this has been fantastic. I can’t get the phrase “AI whisperer” out of my head now, [LAUGHTER] and I think that’s what I want to be when I grow up. So thanks for coming on the show to share your insights on the topic and help to illuminate the path. This is awesome.

SURI: Thank you.

HOLTZ: Well, thank you.

SURI: That was fun.

[MUSIC FADES]

The post Collaborators: Prompt engineering with Siddharth Suri and David Holtz appeared first on Microsoft Research.

Microsoft Research

Ideas: The journey to DNA data storage

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Introducing Yasuyuki Matsushita: Tackling societal challenges with AI at Microsoft Research Asia – Tokyo

Why return to Microsoft Research Asia?

What has changed over the decade?

Plans for the Microsoft Research Asia – Tokyo lab

Cultivating the next generation

Rapid evolution in computer vision

Understanding embodied AI beyond robotics

Advice for aspiring researchers in computer vision and AI

BiomedParse: A foundation model for smarter, all-in-one biomedical image analysis

Image parsing: a unifying framework for holistic image analysis

Abstracts: August 15, 2024

Harnessing GPT-4 for large-scale data synthesis from existing datasets

State-of-the-art performance across 64 major object types in 9 modalities

Recognizing and segmenting irregular and complex objects

Promising step toward scaling holistic biomedical image analysis

GraphRAG: Improving global search via dynamic community selection

Static vs. dynamic global search

Dynamic community selection on the AP News dataset