Abstracts Archives - Microsoft Research http://approjects.co.za/?big=en-us/research/podcast-series/abstracts/ Thu, 22 Aug 2024 13:18:28 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.2 Abstracts: August 15, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-august-15-2024/ Fri, 16 Aug 2024 00:29:59 +0000 http://approjects.co.za/?big=en-us/research/?p=1073550 Advanced AI may make it easier for bad actors to deceive others online. A multidisciplinary research team is exploring one solution: a credential that allows people to show they’re not bots without sharing identifying information. Shrey Jain and Zoë Hitzig explain.

The post Abstracts: August 15, 2024 appeared first on Microsoft Research.

]]>
Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Microsoft Product Manager Shrey Jain and OpenAI Research Scientist Zoë Hitzig join host Amber Tingle to discuss “Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online.” In their paper, Jain, Hitzig, and their coauthors describe how malicious actors can draw on increasingly advanced AI tools to carry out deception, making online deception harder to detect and more harmful. Bringing ideas from cryptography into AI policy conversations, they identify a possible mitigation: a credential that allows its holder to prove they’re a person––not a bot––without sharing any identifying information. This exploratory research reflects a broad range of collaborators from across industry, academia, and the civil sector specializing in areas such as security, digital identity, advocacy, and policy.

Transcript

[MUSIC]

AMBER TINGLE: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research—in brief. I’m Amber Tingle. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Our guests today are Shrey Jain and Zoë Hitzig. Shrey is a product manager at Microsoft, and Zoë is a research scientist at OpenAI. They are two of the corresponding authors on a new paper, “Personhood credentials: Artificial intelligence and the value of privacy-preserving tools to distinguish who is real online.” This exploratory research comprises multidisciplinary collaborators from across industry, academia, and the civil sector. The paper is available now on arXiv. Shrey and Zoë, thank you so much for joining us, and welcome back to the Microsoft Research Podcast.

SHREY JAIN: Thank you. We’re happy to be back.

ZOË HITZIG: Thanks so much.

TINGLE: Shrey, let’s start with a brief overview of your paper. Why is this research important, and why do you think this is something we should all know about?

JAIN: Malicious actors have been exploiting anonymity as a way to deceive others online. And historically, deception has been viewed as this unfortunate but necessary cost as a way to preserve the internet’s commitment to privacy and unrestricted access to information. And today, AI is changing the way we should think about malicious actors’ ability to be successful in those attacks. It makes it easier to create content that is indistinguishable from human-created content, and it is possible to do so in a way that is only getting cheaper and more accessible. And so this paper aims to address a countermeasure to protect against AI-powered deception at scale while also protecting privacy. And I think the reason why people should care about this problem is for two reasons. One is it can very soon become very logistically annoying to deal with these various different types of scams that can occur. I think we’ve all been susceptible to different types of attacks or scams that, you know, people have had. But now these scams are going to become much more persuasive and effective. And so for various different recovery purposes, it can become very challenging to get access back to your accounts or rebuild your reputation that someone may damage online. But more importantly, there’s also very dangerous things that can happen. Kids might not be safe online anymore. Or our ability to communicate online for democratic processes. A lot of the way in which we shape political views today happens online. And that’s also at risk. And in response to that, we propose in this paper a solution titled personhood credentials. Personhood credentials enable people to prove that they are in fact a real person without revealing anything more about themselves online.

TINGLE: Zoë, walk us through what’s already been done in this field, and what’s your unique contribution to the literature here?

HITZIG: I see us as intervening on two separate bodies of work. And part of what we’re doing in this paper is bringing together those two bodies of work. There’s been absolutely amazing work for decades in cryptography and in security. And what cryptographers have been able to do is to figure out protocols that allow people to prove very specific claims about themselves without revealing their full identity. So when you think about walking into a bar and the bartender asks you to prove that you’re over 21—or over 18, depending on where you are—you typically have to show your full driver’s license. And now that’s revealing a lot of information. It says, you know, where you live, whether you’re an organ donor. It’s revealing a lot of information to that bartender. And online, we don’t know what different service providers are storing about us. So, you know, the bartender might not really care where we live or whether we’re an organ donor. But when we’re signing up for digital services and we have to show a highly revealing credential like a driver’s license just to get access to something, we’re giving over too much information in some sense. And so this one body of literature that we’re really drawing on is a literature in cryptography. The idea that I was talking about there, where you can prove privately just isolated claims about yourself, that’s an idea called an anonymous credential. It allows you to be anonymous with respect to some kind of service provider while still proving a limited claim about yourself, like “I am over 18,” or in the case of personhood credentials, you prove, “I am a person.” So that’s all one body of literature. Then there’s this huge other body of literature and set of conversations happening in policy circles right now around what to do about AI. Huge questions abounding. Shrey and I have written a prior paper called “Contextual Confidence and Generative AI,” which we talked about on this podcast, as well, and in that paper, we offered a framework for thinking about the specific ways that generative AI, sort of, threatens the foundations of our modes of communication online. And we outlined about 16 different solutions that could help us to solve the coming problems that generative AI might bring to our online ecosystems. And what we decided to do in this paper was focus on a set of solutions that we thought are not getting enough attention in those AI and AI policy circles. And so part of what this paper is doing is bringing together these ideas from this long body of work in cryptography into those conversations.

TINGLE: I’d like to know more about your methodology, Shrey. How did your team go about conducting this research?

JAIN: So we had a wide range of collaborators from industry, academia, the civil sector who work on topics of digital identity, privacy, advocacy, security, and AI policy which came together to think about, what is the clearest way in which we can explain what we believe is a countermeasure that can protect against AI-powered deception that, from a technological point of view, there’s already a large body of work that we can reference but from a “how this can be implemented.” Discussing the tradeoffs that various different types of academics and industry leaders are thinking about. Can we communicate that very clearly? And so the methodology here was really about bringing together a wide range of collaborators to really bridge these two bodies of work together and communicate it clearly—not just the technical solutions but also the tradeoffs.

TINGLE: So, Zoë, what are the major findings here, and how are they presented in the paper?

HITZIG: I am an economist by training. Economists love to talk about tradeoffs. You know, when you have some of this, it means you have a little bit less of that. It’s kind of like the whole business of economics. And a key finding of the paper, as I see it, is that we begin with what feels like a tradeoff, which is on the one hand, as Shrey was saying, we want to be able to be anonymous online because that has great benefits. It means we can speak truth to power. It means we can protect civil liberties and invite everyone into online spaces. You know, privacy is a core feature of the internet. And at the same time, the, kind of, other side of the tradeoff that we’re often presented is, well, if you want all that privacy and anonymity, it means that you can’t have accountability. There’s no way of tracking down the bad actors and making sure that they don’t do something bad again. And we’re presented with this tradeoff between anonymity on the one hand and accountability on the other hand. All that is to say, a key finding of this paper, as I see it, is that personhood credentials and more generally this class of anonymous credentials that allow you to prove different pieces of your identity online without revealing your entire identity actually allow you to evade the tradeoff and allow you to, in some sense, have your cake and eat it, too. What it allows us to do is to create some accountability, to put back some way of tracing people’s digital activities to an accountable entity. What we also present in the paper are a number of different, sort of, key challenges that will have to be taken into account in building any kind of system like this. But we present all of that, all of those challenges going forward, as potentially very worth grappling with because of the potential for this, sort of, idea to allow us to preserve the internet’s commitment to privacy, free speech, and anonymity while also creating accountability for harm.

TINGLE: So Zoë mentioned some of these tradeoffs. Let’s talk a little bit more about real-world impact, Shrey. Who benefits most from this work?

JAIN: I think there’s many different people that benefit. One is anyone who’s communicating or doing anything online in that they can have more confidence in their interactions. And it, kind of, builds back on the paper that Zoë and I wrote last year on contextual confidence and generative AI, which is that we want to have confidence in our interactions, and in order to do that, one component is being able to identify who you’re speaking with and also doing it in a privacy-preserving way. And I think another person who benefits is policymakers. I think today, when we think about the language and technologies that are being promoted, this complements a lot of the existing work that’s being done on provenance and watermarking. And I think the ability for those individuals to be successful in their mission, which is creating a safer online space, this work can help guide these individuals to be more effective in their mission in that it highlights a technology that is not currently as discussed comparatively to these other solutions and complements them in order to protect online communication.

HITZIG: You know, social media is flooded with bots, and sometimes the problem with bots is that they’re posting fake content, but other times, the problem with bots is that there are just so many of them and they’re all retweeting each other and it’s very hard to tell what’s real. And so what a personhood credential can do is say, you know, maybe each person is only allowed to have five accounts on a particular social media platform.

TINGLE: So, Shrey, what’s next on your research agenda? Are there lingering questions—I know there are—and key challenges here, and if so, how do you hope to answer them?

JAIN: We believe we’ve aggregated a strong set of industry, academic, and, you know, civil sector collaborators, but we’re only a small subset of the people who are going to be interacting with these systems. And so the first area of next steps is to gather feedback about the proposal of a solution that we’ve had and how can we improve that: are there tradeoffs that we’re missing? Are there technical components that we weren’t thinking as deeply through? And I think there’s a lot of narrow open questions that come out of this. For instance, how do personhood credentials relate to existing laws regarding identity theft or protection laws? In areas where service providers can’t require government IDs, how does that apply to personhood credentials that rely on government IDs? I think that there’s a lot of these open questions that we address in the paper that I think need more experimentation and thinking through but also a lot of empirical work to be done. How do people react to personhood credentials, and does it actually enhance confidence in their interactions online? I think that there’s a lot of open questions on the actual effectiveness of these tools. And so I think there’s a large area of work to be done there, as well.

HITZIG: I’ve been thinking a lot about the early days of the internet. I wasn’t around for that, but I know that every little decision that was made in a very short period of time had incredibly lasting consequences that we’re still dealing with now. There’s an enormous path dependence in every kind of technology. And I feel that right now, we’re in that period of time, the small window where generative AI is this new thing to contend with, and it’s uprooting many of our assumptions about how our systems can work or should work. And I’m trying to think about how to set up those institutions, make these tiny decisions right so that in the future we have a digital architecture that’s really serving the goals that we want it to serve.

[MUSIC]

TINGLE: Very thoughtful. With that, Shrey Jain, Zoë Hitzig, thank you so much for joining us today.

HITZIG: Thank you so much, Amber.

TINGLE: And thanks to our listeners, as well. If you’d like to learn more about Shrey and Zoë’s work on personhood credentials and advanced AI, you’ll find a link to this paper at aka.ms/abstracts, or you can read it on arXiv. Thanks again for tuning in. I’m Amber Tingle, and we hope you’ll join us next time on Abstracts.

[MUSIC FADES]

The post Abstracts: August 15, 2024 appeared first on Microsoft Research.

]]>
Abstracts: July 29, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-july-29-2024/ Mon, 29 Jul 2024 16:18:20 +0000 A lack of appropriate data, decreased model performance, and other obstacles have made it difficult to expand the input language models can receive. Li Lyna Zhang introduces LongRoPE, a method capable of extending content windows to more than 2 million tokens.

The post Abstracts: July 29, 2024 appeared first on Microsoft Research.

]]>
Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Li Lyna Zhang (opens in new tab) joins host Gretchen Huizinga to discuss “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (opens in new tab),” which was accepted at this year’s International Conference on Machine Learning (ICML) (opens in new tab). LongRoPE, a method for increasing the input capabilities of language models, can expand context windows to 2-million-plus tokens while maintaining model performance—no major adjustments to the original model architecture needed. LongRoPE has been integrated into Phi-3 (opens in new tab), a family of small language models developed by Microsoft and available on Microsoft Azure (opens in new tab).

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Li Lyna Zhang, a senior researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens.” This paper was featured at this year’s International Conference on Machine Learning, or ICML. Li, thanks so much for joining us today on Abstracts!

LI LYNA ZHANG: Thank you for having me.

HUIZINGA: So let’s start with a brief overview of your paper. Tell us about the issue your research addresses and why it matters.

ZHANG: OK, so this paper is about how to effectively extend the context window of large language models beyond 2 million tokens. Why this is important? Because enabling longer input contexts can improve LLM capabilities. Right now, some LLMs can only handle a limited context window of 4K tokens, which is about 10 pages in a book. With our method, we can push LLM context window to over 2 million tokens. That means you can put all seven Harry Potter books to the LLM and ask any question about this story! Another important thing is that our method is super efficient. It requires minimal changes to the LLM architectures, and most existing optimizations can be reused. Therefore, our method can be easily applied in real production.

HUIZINGA: So it sounds like what you’re working on is improving the memory span of artificial intelligence or large language models. So what’s already been done in this field, and what unique contributions does your work bring?

ZHANG: Well, there has been a lot of work in building long-context LLMs. For example, pretraining with an efficient model architecture, using RAG (retrieval-augmented generation), and extending the context window with RoPE positional interpolation. Our approach uses the last technique. Let me briefly explain it. RoPE stands for rotary positional embedding, which encodes token position information for transformer models. When we pretrain an LLM, we set a context window size, and all token positions have a predefined range of RoPE values. Extending for a longer context window introduces new token positions that can be out of this predefined range, thus leading to out-of-distribution issues and making fine-tuning difficult. RoPE positional interpolation solves this by downscaling positional embeddings to fit within the pretrained range. However, positional embeddings like RoPE exhibit non-uniform information entropy in transformer models. Existing approaches do not effectively handle these non-uniformities during RoPE interpolation, leading to information loss and limiting the context window size. Our method addresses this challenge; therefore, it can achieve the longest context window size.

HUIZINGA: OK, so, Li, how would you describe the methodology you used for this work, and how did you go about conducting the research?

ZHANG: OK. So our method is to interpolate the RoPE positional embedding. It has three main steps. First, we introduce an efficient evolution search algorithm to perform non-uniform RoPE positional interpolation. Second, we propose progressive context window extension strategy. It begins by searching for a 256K length on the pretrained LLM and fine-tuning it at this length. Then, based on the fine-tuned 256K LLM, we did a second search for new RoPE interpolations to achieve 2048K context window size. Finally, since long-context LLMs will drop performance at its original context window, we readjusted the non-uniform positional interpolation at a 4K length to recover the short-context-window performance.

HUIZINGA: Let’s talk about findings. Tell us how things worked out for you and what you found as a result of your experiments.

ZHANG: Yeah. Our study verified two important non-uniformities in LLM context window extension. We identified that lower RoPE dimensions and initial token positions require less interpolation because they contain crucial and high-frequency information. Higher RoPE dimensions require more interpolation because these are sparse and low-frequency information.

HUIZINGA: So work in the lab is always interesting, but deployment in real-world settings is often another story. If everything is successful, Li, who benefits most from your LongRoPE research?

ZHANG: Well, our work significantly improves LLM’s capabilities to handle long context in real-world applications, such as long-context retrieval, code debugging, and even multi-modality LLM applications. Moreover, our method achieves this with minimal modifications to the RoPE positional embedding. Therefore, it can be widely applied to production. We have integrated LongRoPE into Microsoft Phi-3 128K family, which are the first long-context LLMs in its class. Before LongRoPE, Phi models have only 2K context window.

HUIZINGA: So who is your primary user?

ZHANG: I think any users who want to use the long-context LLMs, they can be our audience.

HUIZINGA: So it’s a wide audience.

ZHANG: Yeah, it’s a wide audience.

HUIZINGA: It’s about now that I always ask the “golden nugget” question. If you wanted to leave our listeners with one key takeaway from this research, what would it be?

ZHANG: Well, if there’s one key takeaway from our work, it must be our key findings that non-uniformities in rotary positional embedding are crucial for LLM context window extension. And if you want to build a high-quality long-context LLM, LongRoPE is all you need to know!

HUIZINGA: Talk about what’s left to do in this field in terms of open questions and outstanding challenges. What’s next on your research agenda, Li?

ZHANG: So far, there are still a couple of big questions in this field. First, it’s challenging to achieve both strong long and short capabilities at the same time. Although we have managed to recover some of the short performance for long-context LLM, it has not recovered 100 percent. We are trying different approaches to close these gaps. Second, we want to figure out how we can use these long-context LLMs to solve more challenging tasks, and then we can push this model to work harder and smarter for us.

[MUSIC]

HUIZINGA: Well, Li Lyna Zhang, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find it on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: July 29, 2024 appeared first on Microsoft Research.

]]>
Abstracts: July 18, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-july-18-2024/ Thu, 18 Jul 2024 13:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1057788 Senior Researcher Arindam Mitra introduces AgentInstruct. Using raw data sources, the automated multi-agent framework can create diverse, high-quality synthetic data at scale for the post-training of small and large language models.

The post Abstracts: July 18, 2024 appeared first on Microsoft Research.

]]>
Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Arindam Mitra joins host Gretchen Huizinga to discuss “AgentInstruct: Toward Generative Teaching with Agentic Flows.” In their paper, Mitra and his coauthors introduce an automated multi-agent framework for creating diverse, high-quality synthetic data at scale for language model post-training. In contrast to methods that create data from a seed set of existing prompts and responses, AgentInstruct uses raw data and specifications provided by model builders. The work—which post-trains a model, Orca-3, on AgentInstruct-generated data—is part of project Orca. Orca aims to develop techniques for creating small language models that can perform as well as large language models. Like Orca-3, the earlier Orca, Orca-2, and Orca-Math models show the effectiveness of leveraging synthetic data in training. 

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m here today with Dr. Arindam Mitra, a senior researcher at Microsoft Research and the lead researcher for Microsoft’s Orca project. Dr. Mitra is coauthor of a paper called “AgentInstruct: Toward Generative Teaching with Agentic Flows.” Arindam, it’s a pleasure to have you on Abstracts today.

ARINDAM MITRA: Thank you, Gretchen.

HUIZINGA: So let’s start with a brief overview of your paper. What problem does your research address, and why does it matter?

MITRA: So the post-training phase is very important for language models. You can really improve the model a lot by creating high-quality synthetic data. The problem is, however, though, high-quality synthetic data creation requires lots of human effort and expertise. The problem that we’re trying to tackle is, how do you reduce human effort? How can you create high-quality data with really low amount of human effort? When you have a language model and, let’s say, you want to apply it somewhere, you might have to train a generic model before. Which could be small or big. Doesn’t matter. After that, you can specialize it on the domain that you are looking for, and when you want to do that—to make it really fast, this particular process—it’s best if you go for synthetic data. If you have a way to, actually, generate very high-quality synthetic data, you can fast-track this part of specialization process. Not only single model. So this year, you’re going to see a lot more multi-agent models. And when you are trying to build these multi-agent models, you’re fearing like, OK, it might increase the cost too much, the latency too much. So it’s also very much important that you have a multi-agent system and you can, sort of, replace some of those agents with specialized small models. And when you’re trying to address these goals, you want this process to be something which you know works fast. So that’s why we are trying to make sure we have a very good way to create synthetic data for your specific need.

HUIZINGA: No research exists in a vacuum, and most of it fills some kind of a gap. So tell us what’s already been done in this field and how this work is building on it.

MITRA: So previously, actually, we have seen that in post-training, the more data you have, the better the performance goes for the model you’re training. So what we wanted to test is how much we can scale and what happens if we scale a lot and lot. But we didn’t have the tools for it. So the other approaches people previously used was you had a small set of data and how do we expand this dataset into much larger and larger amount of data. That’s where people were mostly focusing. But it’s not that easy to create that initial seed set. [LAUGHTER] You need to be very expert. The way that we’re doing is, actually, rather you define what you want to create. Like, OK, you want to create tool-use data. So you say, OK, I have a bunch of tools, and I am looking for data in the scenarios where someone can just come give me a description and then maybe that person interact with the AI to figure out how to get the job done. It’s not a one-step thing. And maybe you also have a setting where it’s more like an app developer. You have a bunch of APIs in your phone. You just want to figure out which one is best for the user request, which came through voice command. So different scenarios could be there. So what we’re saying [is], OK, we are not going through the method where you have to come up with your initial own seed data and then we expand. It is more like you define what you want to do. It’s much more abstract. And then, we are, sort of, automating the effort of data creation. So this setting actually of synthetic data creation, we are referring [to] it as generative teaching, and that’s where we are, sort of, differing. So previously, it was more like expansion, and now we are trying from specification to the data that you need.

HUIZINGA: Gotcha. Well talk a little bit more about your methodology and how you went about conducting this research.

MITRA: So first of all, what we are proposing actually is a multi-agent solution. So you start with first describing what you really need. So you describe in detail, like, I need data for this specific skill or this specific scenario. Then, what we do is like, OK, you have some unstructured data or raw data like text documents or code files that you gather from web with permissible license or use something that you own. We don’t care much about what the content is really. So it’s more like we got some random stuff, some random content. And then we’ll guide you how to convert this random something which is not meaningful for you into something which is meaningful for your data creation. For example, like, if you are creating data to teach how to use APIs, you might think about, you need lots of APIs and how do you get these APIs. So what we are saying is, like, we can take something like code and we’ll have agents which will convert these raw code files into list of APIs which is more like a library. So you create automatically this input that is very meaningful for data creation. And then once we have that, we have basically the seed instruction creation step based on your specification. Like, what do you want to create data for? So you have all these different scenarios, and we have multiple agents creating data for different scenarios. And then the last step is actually what we call refinement step. So it’s more like whatever data you created, we’ll go through them and we’ll make them better and better—improve the quality, improve the complexity, improve the trickiness, we’ll teach when not to answer, etc., etc. So make sure we cover the whole space. So by changing the stochastic seed, we are trying to cover the entire possible data space.

HUIZINGA: Right.

MITRA: So that’s the key thing. The way we, sort of, conducted this research is actually we defined 17 skills. Skills meaning reading comprehension, tool use, text modification, content creation, RAG (retrieval-augmented generation) … we have, like, list of 17 skills … conversation … and then we created one multi-agent flow for each of the skills and we generate data. So one key thing I want to highlight is, like, this work, compared to other work, it was not benchmark driven. We want to teach a skill. We don’t care which benchmarks we’re trying to evaluate it on. So we define the skill, like tool use means this to us, reading comprehension means this to us, text modification means this to us. And then we, sort of, generate the data to teach everything for that skill. And then what we did, we created actually 22 million instructions. And we had previously in Orca series, we had 3 million, around, instructions. So the 25 million is what we, sort of, have at the end. And that’s where we actually trained a Mistral model as of now. And we’re going to measure, like, how much we improve the Mistral model by this post-training.

HUIZINGA: Moving from methods to findings, I always look forward to the part of the research paper that finishes the sentence “and what we found was … ,” so give us a quick overview of your results. What did you find?

MITRA: Yes, so the results were actually very exciting for us. So Mistral 7B was our main, sort of, baseline because that’s where we’re trying to showcase, like, how much improvement we are getting. On the other side, we have, like, frontier models—ChatGPT, GPT-4. We want to also measure how far we are from those frontier models, so that’s, sort of, our evaluation setup. So on average actually, we got like 20 percent performance gain over the Mistral, and we evaluated that across 14 benchmarks that test reasoning, content creation, instruction following, format following, etc. But what was more important to us was to do a skill-specific evaluation because we are trying to teach certain skills, and we had, like, 17 skills as we mentioned earlier. So, for example, like, if you are focusing on reading comprehension as a skill, we took LSAT, SAT, and DROP, and many other benchmarks; we created a collection of reading comprehension-based benchmark. And there, we are observing, like, 20 percent improvement over Mistral, and what it means, like, we’re actually achieving GPT-4–level performance. Similarly, if I’m focusing on math skill, there are many datasets which test, like, elementary math, high school math, college-level math. And we improved actually across all these different levels of math. So we see from 40 percent to 150 percent of improvement on different benchmarks of math. So it was more like what we wanted to see. We’re not optimizing for a particular benchmark. We wanted to optimize the skill, and that’s what you’re observing. So you’re observing improvement in math across all these levels, from elementary to high school to college to middle school, etc., everything. The same goes for RAG, as well. We’re observing on RAG skill 92 percent, around, improvement over Mistral. The format following numbers are pretty interesting to us. So format following is very important for SLMs (small language models). You want to make these models practical. You want to make sure that they follow the format so you can parse the result. And we were able to take Mistral beyond Gemini Pro. So that was a very strong performance from the post-training that we did. For summarization, actually we were able to reduce the hallucination rate by 31 percent while achieving the GPT-4–level quality. So overall, all these results were, sort of, highlighting that the methodology that we have, which we’re calling AgentInstruct, is very promising.

HUIZINGA: I think it’s important to get practical and talk about real-world impact. So tell us who you think this research will benefit most and why.

MITRA: Yeah, so again the model builders will, sort of, find it most beneficial. So the significance of our work actually lies in the way we are trying to revolutionize the language model development through scalable, low-effort synthetic creation. And the scalable and low effort is, sort of, the key thing, right. We have shown that we can create very high-quality data. That’s what the numbers are telling us. We want to mention that this is very scalable and low effort, and that’s what we think might help the most for model builders.

HUIZINGA: So, Arindam, let’s borrow a phrase from the machine learning lexicon and go for a little one-shot learning here: if you had to boil down why your work is important, what’s the one thing you want our listeners to take away from this research?

MITRA: The key takeaway would be, like, the AgentInstruct method enables the generation of vast, diverse, and high-quality synthetic data with very minimal human input. So that’s one thing I would, like, to remember from this paper.

HUIZINGA: So as we close, talk briefly about the limitations that you encountered in this project and directions for future research. What are the outstanding challenges in this field, and what’s on your research agenda to overcome them?

MITRA: Yes, so we’re exploring further automation. But apart from making this data creation more automated and less human involvement needed, we’re trying to focus on two other aspects. One is automated model debugging, and the other is automated model repairing. So now that we have the ability to generate data for a particular skill, let’s say math, for model debugging, what we need is basically an error handler. Like something we can plug in which takes the question and the answer coming from a different model and verifies if the answer is correct or not. So that’s the part we’re working on right now, figuring out this error handler. And the second aspect is repairing. So once we have the error, we figure out, OK, this is where the model is struggling. How can we give feedback or how can we give more knowledge so it can basically correct those errors? So those are some things we’re working on right now.

[MUSIC PLAYS]

HUIZINGA: Well, Arindam Mitra, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find a preprint on arXiv. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: July 18, 2024 appeared first on Microsoft Research.

]]>
Abstracts: May 20, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-may-20-2024/ Mon, 20 May 2024 20:15:09 +0000 Andrey Kolobov discusses WindSeer, a small CNN capable of estimating the wind field around an sUAV in flight more finely and with less compute and data than traditional models. The advancement can help support longer and safer autonomous flights.

The post Abstracts: May 20, 2024 appeared first on Microsoft Research.

]]>
Microsoft Research Podcast - Abstracts | May 20, 2024 | Andrey Kolobov

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Principal Research Manager Andrey Kolobov joins host Gretchen Huizinga to discuss “WindSeer: Real-time volumetric wind prediction over complex terrain aboard a small uncrewed aerial vehicle,” or sUAV. sUAVs can fly farther and more safely if they can reason about the terrain-affected wind in their vicinity. Traditional wind predictions ignore small-terrain features and work at the scale of hours and miles, far too coarsely for sUAVs. WindSeer can estimate the terrain-dependent wind field around an sUAV in flight, with limited onboard compute and measurement data, paving the way for safer and more energy-efficient autonomous drone operation.

Learn More:

WindSeer: Real-time volumetric wind prediction over complex terrain aboard a small UAV (opens in new tab)

Transcript 

[MUSIC] 

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  

[MUSIC FADES] 

I’m here today with Dr. Andrey Kolobov, a principal research manager at Microsoft Research. Dr. Kolobov is coauthor of a paper called “WindSeer: Real-time volumetric wind prediction over complex terrain aboard a small uncrewed aerial vehicle,” otherwise known as an sUAV. Andrey Kolobov, great to have you on Abstracts

ANDREY KOLOBOV: Thank you for having me!

HUIZINGA: So let’s start with a sort of abstract of your abstract. In just a few sentences, tell us about the problem your research addresses and more importantly, why we should care about it. 

KOLOBOV: Right, so the overarching goal of this work—and I have to thank my collaborators from ETH Zürich, without whom this work would have been impossible—so the overarching goal of our work was to give drones the ability to stay aloft longer, safer, and cover larger distances. The reason why this is important is because drones’ potential for, for instance, quick delivery of small goods has long been understood, but in practice, their usefulness has been limited by the time they can spend in the air, by how quickly they drain their battery. And lifting these limitations brings the reality of getting the stuff that you order on the internet delivered to you quickly by drones closer. 

HUIZINGA: Is that the core problem, is drone delivery? 

KOLOBOV: Of course, when we were starting this project, we were not interested in any one application. We were interested in implications of AI for drone flight. The limitations of drones’ time aloft ultimately come from drone flight technology, which is very well established, very well understood, and ultimately relies on drones actively fighting forces of nature, such as gravity and wind, and because of this draining their batteries quickly. So within the framework of that technology, it’s difficult to get around these limitations. So what we’re aiming to show is that using AI, drones can reason about their environment in ways that allow them to embrace these forces of nature rather than actively fight them and thereby save a lot on energy and increase their time in the air.

HUIZINGA: Right, so are we conflating drones with sUAVs, as it were, small uncrewed aerial vehicle? 

KOLOBOV: Yes, this work, we are somewhat conflating them, but this work focused specifically on small UAVs, small drones, because these drones’ ability to fight forces of nature is quite limited. Their battery life is way more limited than that of larger drones, and for them, this work is especially important. 

HUIZINGA: OK, and I’m assuming it’s not a new problem and also assuming that you’re not entering a field with no previous research! [LAUGHTER] So what’s been done in this area before, and what gap in the literature or the practice does your research fill? 

KOLOBOV: Yeah, of course. Certainly, many other very, very smart people have thought about this area. What we have tried doing and what we have accomplished differs from previous efforts in how much compute, how little data at inference time, our method requires and also the fine scale at which it makes its predictions. Obviously, there are weather models that model various aspects of the atmosphere, and they can predict wind, but they can do this at the scales of hours, at spatial scales of tens of miles, which is way too crude to be useful for drone flights at low altitudes. And also, these models do this at much higher altitudes, not where drones fly close to the ground, where it’s very important for them to know about wind to avoid collision with terrain potentially, but very high up in the air. The tool that could solve the same problem that we were trying to solve conceptually are computational fluid dynamics simulations, so-called CFD simulations. However, they’re very expensive. They cannot run on the drone. And so if you want the drone to be fully autonomous, they’re not really a feasible solution. 

HUIZINGA: So how would you describe then how you attacked this problem? What methodology did you use for this work, and how did you go about conducting the research?

KOLOBOV: So one thing that people reading about this work might find funny is this déjà vu feeling of seeing the overarching technical insight that we had in a completely different context, in the context of training models such as Phi, Microsoft’s Phi. The reason why it’s funny is because we were trying to solve an entirely different problem in a project that started in a different era, research era, in the pre-large model era, and yet we came up with something quite similar. And this overarching technical insight is this: if you want to build a small but powerful model, one way of doing this is to find a powerful but potentially computationally expensive—or expensive in some other way—generative data source, generate data from that source in a very carefully controlled manner, and use this carefully constructed dataset to train your model. This is exactly what we did. In our case, this powerful but expensive generative data source were the computational fluid dynamic simulations, which we used in combination with 3D terrain maps that are publicly available on the internet to generate a lot of high-quality data, throw in a few more tricks, and get the model that we wanted. 

HUIZINGA: Can you talk about the “few more tricks”? [LAUGHS] 

KOLOBOV: [LAUGHS] Well, so we needed to train this model to make predictions based on very little data. Computational fluid dynamics simulations typically need a lot of data at prediction time. And so the so-called boundary conditions essentially need to know the wind at many locations in order to be able to predict it at the location that you’re interested in. And so we had to structure the data generation in a way that allowed us to avoid this limitation. 

HUIZINGA: Talk to me a little bit more about the datasets that you used. 

KOLOBOV: Yes, so all the data was synthetically generated. 

HUIZINGA: All of it? 

KOLOBOV: All of it! All of it was generated from computational fluid dynamics simulations. 

HUIZINGA: Um, and was this methodology unique and new, or is it, uh, kind of building on other ways of doing things? 

KOLOBOV: So the idea of using high-quality data sources under various guises had been known in the community, to various research communities in any case. Some would refer to it as distillation. Some would refer to it as data simulation. So in the context of these predictive weather models, it would be known as data simulation. But none of them were doing what we were trying to do, again which is getting a model that will make predictions on a very limited compute with a very limited amount of data at inference time. 

HUIZINGA: Well, let’s move from research methods to research findings. Give us a quick overview of how things worked out for you and what you found. 

KOLOBOV: So in a nutshell, as trivial as it sounds, the surprising finding was that it works! [LAUGHTER] Again, the reason why it’s surprising is, again, we used only synthetic data to predict something very, very real and something that people have put a lot of thinking into modeling as part of weather models, for instance. And it turned out that using just synthetic data, you can get a small model that, as the drone is flying through the air and as it’s measuring wind at its current location, this model allows you to predict that there is a downdraft 300 feet away from the drone on the other side of the hill. It’s just amazing that something so small can do something so complex and powerful. 

HUIZINGA: Right. Well, let’s drill in there and, kind of, talk about real-world impact here because this is really important for a lot of wind-prediction scenarios. How does this impact real-world scenarios? Who benefits most from the kinds of applications that you might get from this?

KOLOBOV: Yeah, so there is a number of scenarios where it’s valuable to have a drone—usually a fixed-wing drone that, due to its inherent characteristics, can stay in the air longer than a copter drone—where it’s beneficial to have such a drone stay in the air for long periods of time, silently observing something. So the applications range from agriculture to environment conservation, where you want to track the movements, migrations of animals, to security. And of course, the technology that we develop does not have to be applied to fixed-wing drones. It can also be applied to copter drones, which is the drone model that is usually considered for use in drone delivery, and those drones can also benefit from it, especially in city conditions, where presumably they will have to fly around skyscrapers and take into account the effects that the skyscrapers and other buildings and structures have on the wind near terrain. 

HUIZINGA: So one more question on the real-world impact. In your paper, you talked a little bit about wind farming and other places where understanding how wind works and being able to predict it matters. Is that one? Are there others? 

KOLOBOV: It for sure is one area. Again, in this work, we focused mostly on applications of wind prediction that have to do with drones.  

HUIZINGA: OK.  

KOLOBOV: Besides time aloft, one application is safety. In many places around rough terrain, you know, in the mountains, predicting wind, predicting downdrafts and updrafts, has safety implications because drones fly so close to terrain, and the winds, the airflow, can be so strong in some places over such terrain that it can basically drag the drone into the ground no matter what [the] drone does. It can do it very, very quickly. So again, predicting such phenomena there becomes a matter of drone safety. The same applies, or will apply, in city conditions, where drones will be flying among buildings and wind can be so strong that it can carry a drone into a building or into another obstacle.  

HUIZINGA: Well, I assume you didn’t solve everything with this paper and that there might still be some open questions remaining in the field! So what are some of the big outstanding challenges people still face here, and what’s next on your research agenda to overcome them? 

KOLOBOV: Of course, this work is, in some sense, just the beginning. This work is about helping drones make sense of the environment around them. But this ability to make sense is not by itself useful without drones being able to use the results of this estimation in order to plan how to fly in a safer and more energy-efficient way and to adapt their plans as the environment around them changes. So this is a natural next steps: have drones take their predictions into account when planning their actions. 

HUIZINGA: Well, Andrey Kolobov, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab) or you can find one on arXiv. You can also read it on Nature Communications in Volume 15, April 25. See you next time on Abstracts

[MUSIC]

The post Abstracts: May 20, 2024 appeared first on Microsoft Research.

]]>
Abstracts: May 6, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-may-6-2024/ Mon, 06 May 2024 13:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1030317 Researcher Michel Galley explores how he and fellow researchers combined new and existing data to create MathVista, an open-source benchmark for measuring the mathematical reasoning capabilities of foundation models in scenarios that involve text and images.

The post Abstracts: May 6, 2024 appeared first on Microsoft Research.

]]>
Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Principal Researcher Michel Galley joins host Gretchen Huizinga to discuss “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts,” which was accepted at the 2024 International Conference on Learning Representations (ICLR). MathVista, an open-source benchmark, combines new and existing data to measure how good models are at solving a variety of math problems that involve processing images as well as text, helping to gain insight into their reasoning capabilities.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Michel Galley, a senior principal researcher at Microsoft Research. Dr. Galley is the coauthor of a paper called “MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts.” Michel, thanks for joining us on Abstracts today!

MICHEL GALLEY: Thank you for having me.

HUIZINGA: So I like to start with a distillation or sort of an elevator pitch of your research. Tell us in just a couple sentences what problem or issue your paper addresses and why we should care about it.

GALLEY: So this paper is about evaluating large foundation models. So it’s a very important part of researching large language models because it’s a good way to evaluate, kind of, the capabilities—what these models are good at and not good at. And a part of the focus of MathVista is to evaluate these large foundation models in a multimodal setup, so when the input to the model is actually not just text but also text and images. And then, an example of a task that such a model would perform is, like, the input is maybe a mathematical question, and then there’s some visual support to that question, let’s say, of an image of a graph, and then the model has to respond to something related to that. And why this is important … there has been a lot of work, of course, on large foundation model. Especially when it comes to reasoning tasks, like mathematical reasoning, a lot has focused more on written form.

HUIZINGA: Yeah …

GALLEY: So MathVista is one of the very first datasets that has input that is both images and text.

HUIZINGA: Yeah, yeah. Well, reading your paper, it seems like this is an area that hasn’t been studied systematically. In fact, you actually say that! And say that the field is largely unexplored. But quickly tell us what has been done in this field, and then tell us how your research addresses the proverbial gap in the literature.

GALLEY: Well, there has been a lot of work on vision and language in other problems, like not just about reasoning. Maybe let me just mention why reasoning is important. So one reason I think it’s very interesting to evaluate these large language models in terms of reasoning skill is that we evaluate their capabilities beyond just memorization. So as many of your listeners probably know, these large foundation models are trained on large amounts of text that is public data from various sources. So when you ask a question to a large foundation model, it could be the case, in many cases, that it just memorizes things it has seen in the data.

HUIZINGA: Sure.

GALLEY: So what makes it interesting in terms of reasoning, the answer oftentimes is not there in the data. So it needs to develop this ability to connect the dots between various pieces of information to come up with a new answer. So the focus of our paper is really on mathematical reasoning, but it goes also a bit beyond that because what is also represented in the data is also science question and so on.

HUIZINGA: Yeah …

GALLEY: So this reasoning part has largely focused, until MathVista, on text-only modalities.

HUIZINGA: Yeah …

GALLEY: So it’s one of our very first ones that combines text and images in terms of evaluating these large foundation models. So you ask about what was done before. So, yes, there has been a lot of work, text only, on reasoning, for example, the mathematical question that’s just based on text. And there has been a different stream of work that was much more focused on vision. A lot of work has been on tasks such as visual question answering …

HUIZINGA: Yeah …

GALLEY: … where basically, you have an image and the question is about answer a question about this image. So, yes, we’re trying to fuse the two lines of research here.

HUIZINGA: Right …

GALLEY: And that’s one of the first works that does that.

HUIZINGA: Yeah. Well, let’s talk about your methodology for a minute. Tell us how you went about conducting this research, and what methods did you use?

GALLEY: Yes, sure. So that’s a bit different from a typical, kind of, machine learning paper because the focus on this work is really on benchmarking on the dataset. So the methodology is more about how we collect the data, process it. So they have two components to doing that. One was to look at existing data that already combines vision and text. And there are existing datasets that are actually already fairly big but that were not focused on reasoning. So we use those existing datasets and look for instances in the data that actually include some mathematical or science reasoning. And so that part is leveraging existing datasets, but the important part is, like, we really want to carve out what was interesting piece in terms of reasoning. And we had different stages of processing the data to identify the subset that was reasoning-based. So one first step was basically to apply some automatic filter to determine whether or not a given example, let’s say something that is visual and text, is actually … involves some mathematical reasoning. So we have different strategy. For example, if the answer is numerical, it’s likely that it might be something mathematically related. But that’s just the first stage. And the second stage, we actually had humans, annotators, just certify that the selected data is actually of high quality. So we do have an example of, “Oh, this is mathematical, and that’s either mathematical or scientific,” and so on. And that’s one part of the effort. The other part is that we realized while we collected the data, there are certain types of mathematical reasoning or related to mathematical reasoning that were not represented in the data. So we created three new datasets as part of MathVista. So when I said dataset, it’s more like, think of MathVista as like an aggregate of different types of data, and we added three of them, three new types of data. One is what you call PaperQA, which is basically data that is collected from scientific papers on arXiv, and that had questions asking about that paper and that included some visual components from the paper, typically a plot or a figure.

HUIZINGA: Yeah …

GALLEY: And then we had IQTest, which is basically, I mean, it’s vaguely related mathematically, but basically it also, kind of, tried to see maybe more abstractive thinking about maybe some input that is both text and visual. And the final is about FunctionQA, that is basically algebraic reasoning and function plots and so on.

HUIZINGA: OK …

GALLEY: The important part was actually to identify among vast amounts of data what is actually very interesting in terms of mathematical reasoning.

HUIZINGA: Yeah …

GALLEY: So that part, I think, was quite a big part of doing that work—finding existing data but also creating new data.

HUIZINGA: Yeah, yeah. Well, my favorite part of a research paper is where it says, “and what we found was … ,” so talk a little bit about your results. What did you find?

GALLEY: So we evaluated a wide variety of models, including GPT-4, Claude 2, GPT-4V, multimodal Bard, and LLaVA, and we categorized them into three categories. So one is text only. So, basically, you take a model that is by default just text, and we give it the text part of the question and ask it to answer the question. Of course, that’s, kind of, a bit of a, it’s a difficult task because oftentimes [LAUGHTER] we crucially build these questions so that you have to rely on the vision part. But that’s for, you know, scientific investigation to know how well they can do, and so that’s one category of model. A different category is still text only but that is given the detection from the image. So on the image, we do OCR. So we convert those words from images to text. It’s kind of an extension of the text-based model, except that what was images is translated into text, and then the input to the model is word only, and that’s a different category of model. And the third one is basically truly multimodal model. And what we found, I mean, not surprisingly, it’s, kind of, the one that was doing most poorly is the one that is text only. The second is text plus OCR. And then finally, the one that does best is the multimodal like GPT-4V. But while the ordering between these three categories makes sense, it was a bit surprising that maybe the gap between multimodal and text plus OCR was not bigger. Well, it’s big, but maybe not as big as we were expecting. So, for example, the best detection from the images model achieved like 35 percent accuracy while GPT-4V was 50 percent. So it’s a substantial gap but not huge.

HUIZINGA: Right. Just to clarify, you’re saying OCR. What does that stand for?

GALLEY: [Optical] character recognition.

HUIZINGA: Gotcha.

GALLEY: So, basically, it’s the task of taking text, sometimes typed, but sometimes written, and convert this into the actual text like you would have in a text file.

HUIZINGA: Right. Michel, does any of this have to do with the difficulty of the math problems that you present these models with? I mean, it seems to me, similar to humans, that the easier the problem, the easier it would be for the machine. So at what level of math are we talking for these tests?

GALLEY: What’s nice about MathVista is there’s continuum [of] different difficulties. So the spectrum is quite broad, going from elementary school to more advanced concepts such as calculus. So it’s quite broad. So in the paper, we do have this, kind of, broken down by level. So the number I gave you, like 50 percent, is an aggregate over all the difficulties. But …

HUIZINGA: Gotcha.

GALLEY: But the goal there was really, kind of, to compare different models, but we do have a fair amount of analysis in the appendix. Actually, we have 100 pages of appendices of plenty of analysis and so on. So if people, I mean …

HUIZINGA: I saw that. I saw the length of the paper, and I’m going, what? [LAUGHS] That’s a LONG paper! Well, research in the lab is one thing, I always like to say, but understanding real-world impact is important, too. So where’s this work going to make the most difference, and who does it help most at this point?

GALLEY: Well, I think perhaps that’s the main point of this kind of line of work in terms of reasoning is that when looking at this difficult problem that are mathematical, actually it’s a way to, kind of, abstract away maybe more complex capabilities, and I think while thinking just about mathematics might seem a bit narrow, I don’t think that really is. It’s more about seeing whether this model has the ability to do, kind of, multistep kind of processing of your input and think maybe somewhat intelligently about a given problem. So we focus mostly on math. There is some science, but we would be very interested, especially in future work, to, kind of, go beyond that.

HUIZINGA: OK, well, let me press in a little bit there because … just say I’m a regular person using a GPT model. Is your work more addressed upstream from that to the research community to say, how do we get these models to be better so that downstream people like me can be more confident of the models?

GALLEY: Yes, I would say at the moment, I mean, this line of work is perhaps more geared towards somewhat more research community, but I think it could be some seed for researchers to think about some applications perhaps that also requires some kind of step-by-step reasoning but perhaps not going beyond math.

HUIZINGA: Yeah. Michel, if there was one thing you wanted our listeners to take away from this research, kind of golden nugget, what would it be?

GALLEY: Well, I would say it’s the challenging part of these datasets. I think that’s what makes MathVista stand out compared to other datasets. By now, there are a few other vision and language datasets, and of course, many that are more text-based. And we’ve seen, for example, some recent papers showing that actually MathVista remains one of the most challenging ones. So I think it’s probably going to stay around for a while because of the difficulty it represents. So it’s open source of available datasets that everybody can use, and I very much encourage people to use it.

HUIZINGA: Is it on GitHub?

GALLEY: Yes, it’s on GitHub.

HUIZINGA: So what’s next on the research agenda for helping LLMs get better at math, Michel? What are the big challenges in the field yet? I mean, you’ve alluded to many of them already, sort of, but what’s next on your research agenda?

GALLEY: Well, I would say what we found so far is these models are very good at processing the textual part of problems it’s given, to the model, but you have the equivalent in images actually harder somehow. So I think a lot more work needs to be done in terms of vision capabilities, in terms of reasoning over images, because the capabilities you will see in text are actually quite advanced, whereas the equivalent in images doesn’t seem that good. I mean, a fair disclaimer: my background is more on the text side, [LAUGHTER] so some of my colleagues on the paper are more on the vision side, so maybe if a listener maybe run into some of our coauthors at the conference, they might want to talk to these vision people because that’s less of my background. [LAUGHS]

HUIZINGA: Well, and if you think about Venn diagrams, you know, you’ve got people that are doing text, people that are doing vision, and then the people that are trying to do both to see how the worlds collide.

[MUSIC]

Well, Michel Galley, thanks for joining us today. And to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab), or you can find it on arXiv. You can also read it on the website for the International Conference on Learning Representations, or ICLR. And if you happen to be at the ICLR conference this week, you can hear more about it there. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: May 6, 2024 appeared first on Microsoft Research.

]]>
Abstracts: April 16, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-april-16-2024/ Tue, 16 Apr 2024 13:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1024143 Tusher Chakraborty talks about the paper “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things,” including a method for supporting communication between a large IoT-satellite constellation and devices on Earth within a limited spectrum.

The post Abstracts: April 16, 2024 appeared first on Microsoft Research.

]]>
Stylized microphone and sound waves illustration.

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Research Software Engineer Tusher Chakraborty joins host Gretchen Huizinga to discuss “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things,” which was accepted at the 2024 USENIX Symposium on Networked Systems Design and Implementation (NSDI). In the paper, Chakraborty and his coauthors share their efforts to address the challenges of delivering reliable and affordable IoT connectivity via satellite-based networks. They propose a method for leveraging the motion of small satellites to facilitate efficient communication between a large IoT-satellite constellation and devices on Earth within a limited spectrum.

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

I’m talking today to Tusher Chakraborty, a senior research software engineer at Microsoft Research. Tusher is coauthor of a paper called “Spectrumize: Spectrum-efficient Satellite Networks for the Internet of Things.” Tusher, thanks for joining us on Abstracts!

TUSHER CHAKRABORTY: Hi. Thank you for having me here, Gretchen, today. Thank you.

HUIZINGA: So because this show is all about abstracts, in just a few sentences, tell us about the problem your paper addresses and why we should care about it.

CHAKRABORTY: Yeah, so think of, I’m a farmer living in a remote area and bought a sensor to monitor the soil quality of my farm. The big headache for me would be how to connect the sensor so that I can get access to the sensor data from anywhere. We all know that connectivity is a major bottleneck in remote areas. Now, what if, as a farmer, I could just click the power button of the sensor, and it gets connected from anywhere in the world. It’s pretty amazing, right? And that’s what our research is all about. Get your sensor devices connected from anywhere in the world with just the click of power button. We call it one-click connectivity. Now, you might be wondering, what’s the secret sauce? It’s not magic; it’s direct-to-satellite connectivity. So these sensors directly get connected to the satellites overhead from anywhere on Earth. The satellites, which are orbiting around the earth, collect the data from the sensing devices and forward to the ground stations in some other convenient parts of the world where these ground stations are connected to the internet.

HUIZINGA: So, Tusher, tell us what’s been tried before to address these issues and how your approach contributes to the literature and moves the science forward.

CHAKRABORTY: So satellite connectivity is nothing new and has been there for long. However, what sets us apart is our focus on democratizing space connectivity, making it affordable for everyone on the planet. So we are talking about the satellites that are at least 10 to 20 times cheaper and smaller than state-of-the-art satellites. So naturally, this ambitious vision comes with its own set of challenges. So when you try to make something cheaper and smaller, you’ll face lots of challenges that all these big satellites are not facing. So if I just go a bit technical, think of the antenna. So these big satellite antennas, they can actually focus on particular part of the world. So this is something called beamforming. On the other hand, when we try to make the satellites cheaper and smaller, we can’t have that luxury. We can’t have beamforming capability. So what happens, they have omnidirectional antenna. So it seems like … you can’t focus on a particular part of the earth rather than you create a huge footprint on all over the earth. So this is one of the challenges that you don’t face in the state-of-the-art satellites. And we try to solve these challenges because we want to make connectivity affordable with cheaper and smaller satellites.

HUIZINGA: Right. So as you’re describing this, it sounds like this is a universal problem, and people have obviously tried to make things smaller and more affordable in the past. How is yours different? What methodology did you use to resolve the problems, and how did you conduct the research?

CHAKRABORTY: OK, I’m thrilled that you asked this one because the research methodology was the most exciting part for me here. As a part of this research, we launched a satellite in a joint effort with a satellite company. Like, this is very awesome! So it was a hands-on experience with a real-deal satellite system. It was not simulation-based system. The main goal here was to learn the challenge from a real-world experience and come up with innovative solutions; at the same time, evaluate the solutions in real world. So it was all about learning by doing, and let me tell you, it was quite the ride! [LAUGHTER] We didn’t do anything new when we launched the satellites. We just tried to see how industry today does this. We want to learn from them, hey, what’s the industry practice? We launched a satellite. And then we faced a lot of problems that today’s industry is facing. And from there, we learned, hey, like, you know, this problem is industry facing; let’s go after this, and let’s solve this. And then we tried to come up with the solutions based on those problems. And this was our approach. We didn’t want to assume something beforehand. We want to learn from how industry is going today and help them. Like, hey, these are the problems you are facing, and we are here to help you out.

HUIZINGA: All right, so assuming you learned something and wanted to pass it along, what were your major findings?

CHAKRABORTY: OK, that’s a very good question. So I was talking about the challenges towards this democratization earlier, right? So one of the most pressing challenges: shortage of spectrum. So let me try to explain this from the high level. So we need hundreds of these satellites, hundreds of these small satellites, to provide 24-7 connectivity for millions of devices around the earth. Now, I was talking, the footprint of a satellite on Earth can easily cover a massive area, somewhat similar to the size of California. So now with this large footprint, a satellite can talk with thousands of devices on Earth. You can just imagine, right? And at the same time, a device on Earth can talk with multiple satellites because we are talking about hundreds of these satellites. Now, things get tricky here. [LAUGHTER] We need to make sure that when a device and a satellite are talking, another nearby device or a satellite doesn’t interfere. Otherwise, there will be chaos—no one hearing others properly. So when we were talking about this device and satellite chat, right, so what is that all about? This, all about in terms of communication, is packet exchange. So the device sends some packet to the satellite; satellite sends some packet to the device—it’s all about packet exchange. Now, you can think of, if multiple of these devices are talking with a satellite or multiple satellites are talking with a device, there will be a collision in this packet exchange if you try to send the packets at the same time. And if you do that, then your packet will be collided, and you won’t be able to get any packet on the receiver end. So what we do, we try to send this packet on different frequencies. It’s like a different sound or different tone so that they don’t collide with each other. And, like, now, I said that you need different frequencies, but frequency is naturally limited. And the choice of frequency is even limited. This is very expensive. But if you have limited frequency and you want to resolve this collision, then you have a problem here. How do you do that? So we solve this problem by smartly looking at an artifact of these satellites. So these satellites are moving really fast around the earth. So when they are moving very fast around the earth, they create a unique signature on the frequency that they are using to talk with the devices on Earth. And we use this unique signature, and in physics, this unique signature is known as Doppler signature. And now you don’t need a separate frequency to sound them different, to have packets on different frequencies. You just need to recognize that unique signature to distinguish between satellites and distinguish between their communications and packets. So in that sense, there won’t be any packet collision. And this is all about our findings. So with this, now multiple devices and satellites can talk with each other at the same time without interference but using the same frequency.

HUIZINGA: It sounds, like, very similar to a big room filled with a lot of people. Each person has their own voice, but in the mix, you, kind of, lose track of who’s talking and then you want to, kind of, tune in to that specific voice and say, that’s the one I’m listening to.

CHAKRABORTY: Yeah, I think you picked up the correct metaphor here! This is the scenario you can try to explain here. So, yeah, like what we are essentially doing, like, if you just, in a room full of people and they are trying to talk with each other, and then if they’re using the same tone, no one will be distinguished one person from another.

HUIZINGA: Right …

CHAKRABORTY: Everyone will sound same and that will be colliding. So you need to make sure that, how you can differentiate the tones …

HUIZINGA: Yeah …

CHAKRABORTY: … and the satellites differentiate their tones due to their fast movement. And we use our methodology to recognize that tone, which satellite is sending that tone.

HUIZINGA: So you sent up the experimental satellite to figure out what’s happening. Have you since tested it to see if it works?

CHAKRABORTY: Yeah, yeah, so we have tried it out, because this is a software solution, to be honest.

HUIZINGA: Ah.

CHAKRABORTY: As I was talking about, there is no hardware modification required at this point. So what we did, we just implemented this software in the ground stations, and then we tried to recognize which satellite is creating which sort of signature. That’s it!

HUIZINGA: Well, it seems like this research would have some solid real-world impact. So who would you say it helps most and how?

CHAKRABORTY: OK, that’s a very good one. So the majority of the earth still doesn’t have affordable connectivity. The lack of connectivity throws a big challenge to critical industries such as agriculture—the example that I gave—energy, and supply chain, so hindering their ability to thrive and innovate. So our vision is clear: to bring 24-7 connectivity for devices anywhere on Earth with just a click of power button. Moreover, affordability at the heart of our mission, ensuring that this connectivity is accessible to all. So in core, our efforts are geared towards empowering individuals and industries to unlock their full potential in an increasingly connected world.

HUIZINGA: If there was one thing you want our listeners to take away from this research, what would it be?

CHAKRABORTY: OK, if there is one thing I want you to take away from our work, it’s this: connectivity shouldn’t be a luxury; it’s a necessity. Whether you are a farmer in a remote village or a business owner in a city, access to reliable, affordable connectivity can transform your life and empower your endeavors. So our mission is to bring 24-7 connectivity to every corner of the globe with just a click of a button.

HUIZINGA: I like also how you say every corner of the globe, and I’m picturing a square! [LAUGHTER] OK, last question. Tusher, what’s next for research on satellite networks and Internet of Things? What big unanswered questions or unsolved problems remain in the field, and what are you planning to do about it?

CHAKRABORTY: Uh … where do I even begin? [LAUGHTER] Like, there are countless unanswered questions and unsolved problems in this field. But let me highlight one that we talked here: limited spectrum. So as our space network expands, so does our need for spectrum. But what’s the tricky part here? Just throw more and more spectrum. The problem is the chunk of spectrum that’s perfect for satellite communication is often already in use by the terrestrial networks. Now, a hard research question would be how we can make sure that the terrestrial and the satellite networks coexist in the same spectrum without interfering [with] each other. It’s a tough nut to crack, but it’s a challenge we are excited to tackle head-on as we continue to push the boundaries of research in this exciting field.

[MUSIC]

HUIZINGA: Tusher Chakraborty, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts (opens in new tab). You can also read it on the Networked Systems Design and Implementation, or NSDI, website, and you can hear more about it at the NSDI conference this week. See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: April 16, 2024 appeared first on Microsoft Research.

]]>
Abstracts: March 21, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-march-21-2024/ Thu, 21 Mar 2024 13:00:00 +0000 http://approjects.co.za/?big=en-us/research/podcast/abstracts-march-21-2024/ Senior Researcher Chang Liu discusses M-OFDFT, a variation of orbital-free density functional theory (OFDFT) that leverages deep learning to help identify molecular properties in a way that minimizes the tradeoff between accuracy and efficiency.

The post Abstracts: March 21, 2024 appeared first on Microsoft Research.

]]>
Microsoft Research Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Researcher Chang Liu joins host Gretchen Huizinga to discuss Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” In the paper, Liu and his coauthors present M-OFDFT, a variation of orbital-free density functional theory (OFDFT). M-OFDFT leverages deep learning to help identify molecular properties in a way that minimizes the tradeoff between accuracy and efficiency, work with the potential to benefit areas such as drug discovery and materials discovery.

Transcript

[MUSIC PLAYS] 

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers. 

[MUSIC FADES] 

Today, I’m talking to Dr. Chang Liu, a senior researcher from Microsoft Research AI4Science. Dr. Liu is coauthor of a paper called “Overcoming the barrier of orbital-free density functional theory for molecular systems using deep learning.” Chang Liu, thanks for joining us on Abstracts

CHANG LIU: Thank you. Thank you for this opportunity to share our work. 

HUIZINGA: So in a few sentences, tell us about the issue or problem your paper addresses and why people should care about this research. 

LIU: Sure. Since this is an AI4Science work, let’s start from this perspective. About science, people always want to understand the properties of matters, such as why some substances can cure disease and why some materials are heavy or conductive. For a very long period of time, these properties can only be studied by observation and experiments, and the outcome will just look like magic to us. If we can understand the underlying mechanism and calculate these properties on our computer, then we can do the magic ourselves, and it can, hence, accelerate industries like medicine development and material discovery. Our work aims to develop a method that handles the most fundamental part of such property calculation and with better accuracy and efficiency. If you zoom into the problem, properties of matters are determined by the properties of molecules that constitute the matter. For example, the energy of a molecule is an important property. It determines which structure it mostly takes, and the structure indicates whether it can bind to a disease-related biomolecule. You may know that molecules consist of atoms, and atoms consist of nuclei and electrons, so properties of a molecule are the result of the interaction among the nuclei and the electrons in the molecule. The nuclei can be treated as classical particles, but electrons exhibit significant quantum effect. You can imagine this like electrons move so fast that they appear like cloud or mist spreading over the space. To calculate the properties of the molecule, you need to first solve the electronic structure—that is, how the electrons spread over this space. This is governed by an equation that is hard to solve. The target of our research is hence to develop a method that solves the electronic structure more accurately and more efficiently so that properties of molecules can be calculated in a higher level of accuracy and efficiency that leads to better ways to solve the industrial problems. 

HUIZINGA: Well, most research owes a debt to work that went before but also moves the science forward. So how does your approach build on and/or differ from related research in this field? 

LIU: Yes, there are indeed quite a few methods that can solve the electronic structure, but they show a harsh tradeoff between accuracy and efficiency. Currently, density functional theory, often called DFT, achieves a preferred balance for most cases and is perhaps the most popular choice. But DFT still requires a considerable cost for large molecular systems. It has a cubic cost scaling. We hope to develop a method that scales with a milder cost increase. We noted an alternative type of method called orbital-free DFT, or called OFDFT, which has a lower order of cost scaling. But existing OFDFT methods cannot achieve satisfying accuracy on molecules. So our work leverages deep learning to achieve an accurate OFDFT method. The method can achieve the same level of accuracy as conventional DFT; meanwhile, it inherits the cost scaling of OFDFT, hence is more efficient than the conventional DFT. 

HUIZINGA: OK, so we’re moving acronyms from DFT to OFDFT, and you’ve got an acronym that goes M-OFDFT. What does that stand for? 

LIU: The M represents molecules, since it is especially hard for classical or existing OFDFT to achieve a good accuracy on molecules. So our development tackles that challenge. 

HUIZINGA: Great. And I’m eager to hear about your methodology and your findings. So let’s go there. Tell us a bit about how you conducted this research and what your methodology was. 

LIU: Yeah. Regarding methodology, let me delve into a bit into some details. We follow the formulation of OFDFT, which solves the electronic structure by optimizing the electron density, where the optimization objective is to minimize the electronic energy. The challenge in OFDFT is, part of the electronic energy, specifically the kinetic energy, is hard to calculate accurately, especially for molecular systems. Existing computation formulas are based on approximate physical models, but the approximation accuracy is not satisfying. Our method uses a deep learning model to calculate the kinetic energy. We train the model on labeled data, and by the powerful learning ability, the model can give a more accurate result. This is the general idea, but there are many technical challenges. For example, since the model is used as an optimization objective, it needs to capture the overall landscape of the function. The model cannot recover the landscape if only one labeled data point is provided. For this, we made a theoretical analysis on the data generation method and found a way to generate multiple labeled data points for each molecular structure. Moreover, we can also calculate a gradient label for each data point, which provides the slope information on the landscape. Another challenge is that the kinetic energy has a strong non-local effect, meaning that the model needs to account for the interaction between any pair of spots in space. This incurs a significant cost if using the conventional way to represent density—that is, to using a grid. For this challenge, we choose to expand the density function on a set of basis functions and use the expansion coefficients to represent the density. The benefit is that it greatly reduces the representation dimension, which in turn reduces the cost for non-local calculation. These two examples are also the differences from other deep learning OFDFT works. There are more technical designs, and you may check them in the paper. 

HUIZINGA: So talk about your findings. After you completed and analyzed what you did, what were your major takeaways or findings? 

LIU: Yeah, let’s dive into the details, into the empirical findings. We find that our deep learning OFDFT, abbreviated as M-OFDFT, is much more accurate than existing OFDFT methods with tens to hundreds times lower error and achieves the same level of accuracy as the conventional DFT. 

HUIZINGA: Wow … 

LIU: On the other hand, the speed is indeed improved over conventional DFT. For example, on a protein molecule with more than 700 atoms, our method achieves nearly 30 times speedup. The empirical cost scaling is lower than quadratic and is one order less than that of conventional DFT. So the speed advantage would be more significant on larger molecules. I’d also like to mention an interesting observation. Since our method is based on deep learning, a natural question is, how accurate would the method be if applied to much larger molecules than those used for training the deep learning model? This is the generalization challenge and is one of the major challenges of deep learning method for molecular science applications. We investigated this question in our method and found that the error increases slower than linearly with molecular size. Although this is not perfect since the error is still increasing, but it is better than using the same model to predict the property directly, which shows an error that increases faster than linearly. This somehow shows the benefits of leveraging the OFDFT framework for using a deep learning method to solve molecular tasks. 

HUIZINGA: Well, let’s talk about real-world impact for a second. You’ve got this research going on in the lab, so to speak. How does it impact real-life situations? Who does this work help the most and how? 

LIU: Since our method achieves the same level of accuracy as conventional DFT but runs faster, it could accelerate molecular property calculation and molecular dynamic simulation especially for large molecules; hence, it has the potential to accelerate solving problems such as medicine development and material discovery. Our method also shows that AI techniques can create new opportunities for other electronic structure formulations, which could inspire more methods to break the long-standing tradeoff between accuracy and efficiency in this field. 

HUIZINGA: So if there was one thing you wanted our listeners to take away, just one little nugget from your research, what would that be? 

LIU: If only for one thing, that would be we develop the method that solves molecular properties more accurately and efficiently than the current portfolio of available methods. 

HUIZINGA: So finally, Chang, what are the big unanswered questions and unsolved problems that remain in this field, and what’s next on your research agenda? 

LIU: Yeah, sure. There indeed remains problems and challenges. One remaining challenge mentioned above is the generalization to molecules much larger than those in training. Although the OFDFT method is better than directly predicting properties, there is still room to improve. One possibility is to consider the success of large language models by including more abundant data and more diverse data in training and using a large model to digest all the data. This can be costly, but it may give us a surprise. And another way we may consider is to incorporate mathematical structures of the learning target functional into the model, such as convexity, lower and upper bounds, and some invariance. And such structures could regularize the model when applied to larger systems than it has seen during training. So we have actually incorporated some such structures into the model, for example, the geometric invariance, but other mathematical properties are nontrivial to incorporate. We made some discussions in the paper, and we’ll engage working on that direction in the future. The ultimate goal underlying this technical development is to build a computational method that is fast and accurate universally so that we can simulate the molecular world of any kind. 

[MUSIC PLAYS] 

HUIZINGA: Well, Chang Liu, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts. You can also read it on arXiv, or you can check out the March 2024 issue of Nature Computational Science. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: March 21, 2024 appeared first on Microsoft Research.

]]>
Abstracts: February 29, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-february-29-2024/ Thu, 29 Feb 2024 14:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1009941 Can how we think about our thinking help us better incorporate generative AI in our lives & work? Explore metacognition’s potential to improve the tech’s usability on “Abstracts,” then sign up for Microsoft Research Forum for more on this & other AI work.

The post Abstracts: February 29, 2024 appeared first on Microsoft Research.

]]>
MSR Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Behavioral Science Researcher Lev Tankelevitch joins host Gretchen Huizinga to discuss “The Metacognitive Demands and Opportunities of Generative AI.” In their paper, Tankelevitch and his coauthors propose using the scientific study of how people monitor, understand, and adapt their thinking to address common challenges of incorporating generative AI into life and work—from crafting effective prompts to determining the value of AI-generated outputs.  

To learn more about the paper and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.  

[MUSIC FADES] 

Today, I’m talking to Dr. Lev Tankelevitch, a senior behavioral science researcher from Microsoft Research. Dr. Tankelevitch is coauthor of a paper called “The Metacognitive Demands and Opportunities of Generative AI,” and you can read this paper now on arXiv. Lev, thanks for joining us on Abstracts

LEV TANKELEVITCH: Thanks for having me. 

HUIZINGA: So in just a couple sentences—a metacognitive elevator pitch, if you will—tell us about the issue or problem your paper addresses and, more importantly, why we should care about it. 

TANKELEVITCH: Sure. So as generative AI has, sort of, rolled out over the last year or two, we’ve seen some user studies come out, and as we read these studies, we noticed there are a lot of challenges that people face with these tools. So people really struggle with, you know, writing prompts for systems like Copilot or ChatGPT. For example, they don’t even know really where to start, or they don’t know how to convert an idea they have in their head into, like, clear instructions for these systems. If they’re, sort of, working in a field that maybe they’re less familiar with, like a new programming language, and they get an output from these systems, they’re not really sure if it’s right or not. And then, sort of, more broadly, they don’t really know how to fit these systems into their workflows. And so we’ve noticed all these challenges, sort of, arise, and some of them relate to, sort of, the unique features of generative AI, and some relate to the design of these systems. But basically, we started to, sort of, look at these challenges, and try to understand what’s going on—how can we make sense of them in a more coherent way and actually build systems that really augment people and their capabilities rather than, sort of, posing these challenges? 

HUIZINGA: Right. So let’s talk a little bit about the related research that you’re building on here and what unique insights or directions your paper adds to the literature. 

TANKELEVITCH: So as I mentioned, we were reading all these different user studies that were, sort of, testing different prototypes or existing systems like ChatGPT or GitHub Copilot, and we noticed different patterns emerging, and we noticed that the same kinds of challenges were cropping up. But there weren’t any, sort of, clear coherent explanations that tied all these things together. And in general, I’d say that human-computer interaction research, which is where a lot of these papers are coming out from, it’s really about building prototypes, testing them quickly, exploring things in an open-ended way. And so we thought that there was an opportunity to step back and to try to see how we can understand these patterns from a more theory-driven perspective. And so, with that in mind, one perspective that became clearly relevant to this problem is that of metacognition, which is this idea of “thinking about thinking” or how we, sort of, monitor our cognition or our thinking and then control our cognition and thinking. And so we thought there was really an opportunity here to take this set of theories and research findings from psychology and cognitive science on metacognition and see how they can apply to understanding these usability challenges of generative AI systems. 

HUIZINGA: Yeah. Well, this paper isn’t a traditional report on empirical research as many of the papers on this podcast are. So how would you characterize the approach you chose and why?

TANKELEVITCH: So the way that we got into this, working on this project, it was, it was quite organic. So we were looking at these user studies, and we noticed these challenges emerging, and we really tried to figure out how we can make sense of them. And so it occurred to us that metacognition is really quite relevant. And so what we did was we then dove into the metacognition research from psychology and cognitive science to really understand what are the latest theories, what are the latest research findings, how could we understand what’s known about that from that perspective, from that, sort of, fundamental research, and then go back to the user studies that we saw in human-computer interaction and see how those ideas can apply there. And so we did this, sort of, in an iterative way until we realized that we really have something to work with here. We can really apply a somewhat coherent framework onto these, sort of, disparate set of findings not only to understand these usability challenges but then also to actually propose directions for new design and research explorations to build better systems that support people’s metacognition. 

HUIZINGA: So, Lev, given the purpose of your paper, what are the major takeaways for your readers, and how did you present them in the paper? 

TANKELEVITCH: So I think the key, sort of, fundamental point is that the perspective of metacognition is really valuable for understanding the usability challenges of generative AI and potentially designing new systems that support metacognition. And so one analogy that we thought was really useful here is of a manager delegating tasks to a team. And so a manager has to determine, you know, what is their goal in their work? What are the different subgoals that that goal breaks down into? How can you communicate those goals clearly to a team, right? Then how do you assess your team’s outputs? And then how do you actually adjust your strategy accordingly as the team works in an iterative fashion? And then at a higher level, you have to really know how to—actually what to delegate to your team and how you might want to delegate that. And so we realized that working with generative AI really parallels these different aspects of what a manager does, right. So when people have to write a prompt initially, they really have to have self-awareness of their task goals. What are you actually trying to achieve? How does that translate into different subtasks? And how do you verbalize that to a system in a way that system understands? You might then get an output and you need to iterate on that output. So then you need to really think about, what is your level of confidence in your prompting ability? So is your prompting the main reason why the output isn’t maybe as satisfactory as you want, or is it something to do with the system? Then you actually might get the output [you’re] happy with, but you’re not really sure if you should fully rely on it because maybe it’s an area that is outside of your domain of expertise. And so then you need to maintain an appropriate level of confidence, right? Either to verify that output further or decide not to rely on it, for example. And then at a, sort of, broader level, this is about the question of task delegation. So this requires having self-awareness of the applicability of generative AI to your workflows and maintaining an appropriate level of confidence in completing tasks manually or relying on generative AI. For example, whether it’s worth it for you to actually learn how to work with generative AI more effectively. And then finally, it requires, sort of, metacognitive flexibility to adapt your workflows as you work with these tools. So are there some tasks where the way that you’re working with them is, sort of, slowing you down in specific ways? So being able to recognize that and then change your strategies as necessary really requires metacognitive flexibility. So that was, sort of, one key half of our findings.  

And then beyond that we really thought about how we can use this perspective of metacognition to design better systems. And so one, sort of, general direction is really about supporting people’s metacognition. So we know from research from cognitive science and psychology that we can actually design interventions to improve people’s metacognition in a lasting and effective way. And so similarly, we can design systems that support people’s metacognition. For example, systems that support people in planning their tasks as they actually craft prompts. We can support people in actually reflecting on their confidence in their prompting ability or in assessing the output that they see. And so this relates a little bit to AI acting as a coach for you, which is an idea that the Microsoft Research New York City team came up with. So this is Jake Hofman, David Rothschild, and Dan Goldstein. And so, in this way, generative AI systems can really help you reflect as a coach and understand whether you have the right level of confidence in assessing output or crafting prompts and so on. And then similarly, at a higher level, they can help you manage your workflows, so helping you reflect on whether generative AI is really working for you in certain tasks or whether you can adapt your strategy in certain ways. And likewise, this relates also to explanations about AI, so how you can actually design systems that are explainable to users in a way that helps them achieve their goals? And explainability can be thought about as a way to actually reduce the metacognitive demand because you’re, sort of, explaining things in a way to people that they don’t have to keep in their mind and have to think about, and that, sort of, improves their confidence. It can help them improve their confidence or calibrate their confidence in their ability to assess outputs. 

HUIZINGA: Talk for a minute about real-world impact of this research. And by that, I mean, who does it help most and how? Who’s your main audience for this right now?

TANKELEVITCH: In a sense, this is very broadly applicable. It’s really about designing systems that people can interact with in any domain and in any context. But I think, given how generative AI has rolled out in the world today, I mean, a lot of the focus has been on productivity and workflows. And so this is a really well-defined, clear area where there is an opportunity to actually help people achieve more and stay in control and actually be more intentional and be more aligned with their goals. And so this is, this is an approach where not only can we go beyond, sort of, automating specific tasks but actually use these systems to help people clarify their goals and track with them in a more effective way. And so knowledge workers are an obvious, sort of, use case or an obvious area where this is really relevant because they work in a complex system where a lot of the work is, sort of, diffused and spread across collaborations and artifacts and softwares and different ways of working. And so a lot of things are, sort of, lost or made difficult by that complexity. And so systems, um, that are flexible and help people actually reflect on what they want to achieve can really have a big impact here. 

HUIZINGA: Mm-hmm. Are you a little bit upstream of that even now in the sense that this is a “research direction” kind of paper. I noticed that as I read it, I felt like this was how researchers can begin to think about what they’re doing and how that will help downstream from that. 

TANKELEVITCH: Yes. That’s exactly right. So this is really about, we hope, unlocking a new direction of research and design where we take this perspective of metacognition—of how we can help people think more clearly and, sort of, monitor and control their own cognition—and design systems to help them do that. And in the paper, there’s a whole list of different questions, both fundamental research questions to understand in more depth how metacognition plays a role in human-AI interaction when people work with generative AI systems but also how we can then actually design new interventions or new systems that actually support people’s metacognition. And so there’s a lot of work to do in this, and we hope that, sort of, inspires a lot of further research, and we’re certainly planning to do a lot more follow-up research. 

HUIZINGA: Yeah. So I always ask, if there was just one thing that you wanted our listeners to take away from this work, a sort of golden nugget, what would it be? 

TANKELEVITCH: I mean, I’d say that if we really want generative AI to be about augmenting human agency, then I think we need to focus on understanding how people think and behave in their real-world context and design for that. And so I think specifically, the real potential of generative AI here, as I was saying, is not just to automate a bunch of tasks but really to help people clarify their intentions and goals and act in line with them. And so, in a way, it’s kind of about building tools for thought, which was the real vision of the early pioneers of computing. And so I hope that this, kind of, goes back to that original idea.

HUIZINGA: You mentioned this short list of open research questions in the field, along with a list of suggested interventions. You’ve, sort of, curated that for your readers at the end of the paper. But give our audience a little overview of that and how those questions inform your own research agenda coming up next. 

TANKELEVITCH: Sure. So on the, sort of, fundamental research side of things, there are a lot of questions around how, for example, self-confidence that people have plays a role in their interactions with generative AI systems. So this could be self-confidence in their ability to prompt these systems. And so that is one interesting research question. What is the role of confidence and calibrating one’s confidence in prompting? And then similarly, on the, sort of, output evaluation side, when you get an output from generative AI, how do you calibrate your confidence in assessing that output, right, especially if it’s in an area where maybe you’re less familiar with? And so there’s these interesting, nuanced questions around self-confidence that are really interesting, and we’re actually exploring this in a new study. This is part of the AI, Cognition, and [the] Economy pilot project. So this is a collaboration that we’re running with Dr. Clara Colombatto, who’s a researcher in University of Waterloo and University College London, and we’re essentially designing a study where we’re trying to understand people’s confidence in themselves, in their planning ability, and in working with AI systems to do planning together, and how that influences their reliance on the output of generative AI systems. 

[MUSIC PLAYS] 

HUIZINGA: Well, Lev Tankelevitch, thank you for joining us today, and to our listeners, thanks for tuning in. If you want to read the full paper on metacognition and generative AI, you can find a link at aka.ms/abstracts, or you can read it on arXiv. Also, Lev will be speaking about this work at the upcoming Microsoft Research Forum, and you can register for this series of events at researchforum.microsoft.com. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: February 29, 2024 appeared first on Microsoft Research.

]]>
Abstracts: January 25, 2024 http://approjects.co.za/?big=en-us/research/podcast/abstracts-january-25-2024/ Thu, 25 Jan 2024 14:00:00 +0000 http://approjects.co.za/?big=en-us/research/?p=1000743 On “Abstracts,” Jordan Ash & Dipendra Misra discuss the parameter reduction method LASER. Tune in to learn how selective removal of stored data alone can boost LLM performance, then sign up for Microsoft Research Forum for more on LASER & related topics.

The post Abstracts: January 25, 2024 appeared first on Microsoft Research.

]]>
MSR Podcast - Abstracts hero with a microphone icon

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researchers Jordan Ash and Dipendra Misra join host Gretchen Huizinga to discuss “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” which was accepted to the 2024 International Conference on Learning Representations (ICLR). Layer-Selective Rank reduction, or LASER, is an intervention for targeted parameter reduction in transformer-based models. The work shows that the removal of certain parameters not only maintains model performance like some existing parameter-reduction methods but can actually improve it—no additional training necessary.

To learn more about the paper and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Dipendra Misra and Dr. Jordan Ash, both senior researchers at Microsoft Research. Drs. Misra and Ash are coauthors of a paper called “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” also known as LASER. This paper has been accepted at the International Conference on Learning Representations, or ICLR, in Vienna this year, and you can read a preprint of it now on arXiv. Dipendra, Jordan, thanks for joining us on Abstracts!

JORDAN ASH: Thanks for having us.

DIPENDRA MISRA: Yeah, thanks for having us, Gretchen.

HUIZINGA: Dipendra, let’s start with a general overview of this paper. In a few sentences, describe the issue or problem your work addresses and, perhaps more importantly, why we should care about it.

MISRA: Thanks, Gretchen. So as we know, large language models, also known as LLMs, have revolutionized both business and research in artificial intelligence. They are everywhere, being used to solve a wide range of problems. So in our paper, we introduce an intervention which can be applied to any existing pretrained large language models, and our main purpose for introducing this is to see how it affects the performance of the LLMs and whether we can gain insight into how an LLM stores information in its parameters and how it uses that information to generate a response. And what our intervention does is that it performs a low-rank approximation of the parameters of the LLM. And the surprising discovery that our paper makes is that if you do this intervention correctly, then we can get significant improvement on various tasks for different LLMs.

HUIZINGA: So that’s the first part of the question. Tell me why I should care about it!

MISRA: So if you are a person who uses LLMs for solving any tasks, then you do care about performance on a given task. So, for example, you could be using LLMs to generate an email, right, from a given description. Or you could be using an LLM to do question answering. And by applying our intervention, we can gain accuracy on the task that we care about.

HUIZINGA: Well, let’s stick with you, Dipendra, for a minute and talk about the field writ large. Almost all research owes a debt to some other research that went before. So tell us a bit about the related work in this field and how your work builds on or adds to it.

MISRA: So the work that is most closely related to our LASER paper is this growing body of work on understanding how knowledge is stored and edited inside a large language model. So these works don’t apply the intervention that we do, but they were certainly inspirational for us for arriving at the intervention that we introduced. Another line of work which is very related is, like, adding a small number of parameters to improve the performance of the LLM on a given task. The most relevant work in this space is the LoRA paper, also known as the “Low-Rank Adaptation of Large Language Models,” which came from Microsoft. And what LoRA does, it adds a small number of additional parameters to an LLM and then fine-tunes it on a given task. And what our intervention, called LASER, does is that it removes parameters instead of adding it. And another line of work which is also related is the work on model compression. So there are people who focus on breaking down the size of the models as much as possible while still retaining the performance, more or less, compared to the base model. And so these people are also focused on removing parameters, but they are coming at a different angle of, like, trying to reduce the memory footprint, while what we were doing is that we are less focused on the memory footprint—that’s more like a side effect of it—and more like if I were to fiddle with this parameter of the LLM, then how does it affect the performance? And what can we learn by looking at the comparison? Like, OK, so if I remove this parameter, I see the performance drop; then it means that these parameters are storing something about this type of task on which the performance is dropping.

HUIZINGA: So I’ll ask you one more question, Dipendra, before I pull Jordan into the conversation, and that would be about your methodology. How would you describe your approach to this project, and how did you conduct the research?

MISRA: So we started by analyzing the intervention LASER on a particular LLM called GPT-J and evaluating its performance on this question-answering data CounterFact. So our idea was, like, before trying this thing on [a] bunch of things, let’s just understand this in one setting deeply and, kind of, build insights that we can then evaluate in other settings. And the reason we chose this setup was that the GPT-J large language model has its training data publicly available. It’s called the Pile dataset. And that allows us to do analysis with the training data. For example, is the performance dropping on data points which are rarer or more frequent in the training data? And this is important because training data analysis is frequently omitted in existing LLM literature, and that’s something we wanted to do. And the second reason is that the CounterFact question-answering data is both related to the prior work in this space, so there was a reason for choosing it, but also it has paraphrases of the same question. For example, it might ask, like, “Who is the president of United States of America?” But it will also have paraphrases like “The president of the United States of America is …” or “The head of the government of United States of America is …” And so it will have different variations of the same question. And then you can see if the LLM is able to get all of them right, or is it not robust to variations of the same question? And so we did analysis on this GPT-J and CounterFact dataset. And Jordan will talk more about what the results were. And so based on this rigorous analysis, we developed some insights as to what the intervention is doing. And then we evaluated these insights on other settings. So then we tried, like, two other different large language models and evaluated it on, like, multiple different datasets. And then we saw that the insights actually hold more broadly. And finally, we also evaluated this in a non-text related task, right. Because the intervention could, in principle, be applied to any neural network. So we went after this reinforcement learning model, which solves a puzzle called Sokoban. And we also saw that if you apply this intervention correctly, then you can get some performance improvement. So it’s not related to just large language models, although that was our main motivation.

HUIZINGA: Well, Jordan, let’s get your take on the last few questions here. As I’ve said before, the most interesting section of a research paper for me is the part where it says, “and what we found was …” So as a result of this research, what did you find? Were there outcomes that you expected, or were there any surprises?

ASH: I would say this paper is full of surprises. So as Dipendra was mentioning earlier, the LASER intervention removes information from a model. It doesn’t add information to a model. And up until now, there’s been a lot of work on pruning model parameters for a variety of reasons. But generally, these papers show that as parameters are removed from the model, performance just does not degrade. You can, overall, keep performance roughly the same even with a fairly drastic reduction of model parameters. And those reductions are typically done across layers of the model. What we’re showing here is surprising because we’re showing if we do a very targeted intervention, maybe at only one layer of the model, we could actually get a big boost in performance rather than just, you know, keep it the same or something like this.

HUIZINGA: Hmm. So with those results in mind, Jordan, I’m curious about practical applications. How would you say this research makes an impact in real-world situations? I know that Dipendra alluded to that earlier, but where is this most useful and who benefits most?

ASH: I think the short sales pitch for this technique is that you could potentially improve the performance of a language model with no additional training at all just by applying this intervention, which again just removes information from the model, so you don’t need to have any extra data on hand to refine the model or to add new information into it. The real-world situations we’re seeing a boost right now in LASER is for, like, question answering or reasoning-type tasks where there is, there’s, like, a concrete answer that corresponds to what you’re asking the LLM rather than just a, sort of, like, broad-purpose generative task.

HUIZINGA: So typically speaking, when you’re dealing with LLMs, part of the issue is prompt engineering. And it’s like my responsibility to be able to put the right words in it so I’ll get the best answer from the model, right? Are you saying that this helps me not have to be that good on the prompt-engineer end versus what the model can interpret and do?

ASH: I think prompt engineering still has a place in, sort of, eking out a good answer from a language model, but given a fixed prompt, this intervention seems to offer an improved accuracy over not intervening at all and applying the same prompt.

HUIZINGA: So, Jordan, I often think of an abstract as a sort of appetizer for a research paper. But let’s distill it even further. If there was one thing—sort of an amuse-bouche, if you will—that you want our listeners to take away from this work, what would it be?

ASH: For me, I like this idea of how, you know, typically if you want to get a model to perform better, you would take that model off the shelf and you would refine it on data related to the task at hand. And that might take the form of refining all of the parameters or doing some low-rank LoRA-type thing that Dipendra alluded to earlier. Here, we counterintuitively show that sometimes just carefully removing information from the model can have a positive effect, as well. And this is great news because refining a model requires a lot of new target domain data to be available, but removing information from the model doesn’t necessarily have that same constraint.

HUIZINGA: Well, finally, let’s talk a little bit about the future, Jordan, and I’ll have you close the show for us. What unanswered questions or ongoing research challenges do you see here, and what’s next maybe on your research agenda?

ASH: Yeah, I think there’s a lot of exciting future work for this project. I think for one, as a practical matter, there’s this question of just what’s the best way to find the best LASER intervention? LASER targets a specific layer of the model, and then it finds the extent by which it should be rank-reduced. That search procedure is, kind of, expensive. Right now, we’re doing it in a, sort of, exhaustive way. But also, it seems to be beneficial to apply LASER at multiple layers of the model. And that makes the search procedure, sort of, combinatorially explode. So finding out the best way to compose these interventions, I think, is an important area of future research. And then just, sort of, less on the practical side, I think there are all these questions related to just, why does this work at all? Like, why is it helpful to remove information from the model? And, you know, I think there are some rough ideas we have about this. For example, when you’re training a model on lots and lots of data, you know, it’s not all created equally. Some of it might be noisy or low quality, and some of it might be high quality. And maybe it’s better to remove those samples at training time to get a better model. So I guess there’s this question of, is pruning the model using a LASER-type intervention roughly equivalent to pruning the training data in a way to make it more favorable for eliciting a high-quality model? And again, like Dipendra alluded to earlier, this LoRA procedure, which does something that very much complements LASER and is often used to add information to a model, is it possible that LoRA is actually not just adding information but also removing information from the model? And perhaps that’s one reason why LASER seems to be so effective.

HUIZINGA: So lots of questions.

ASH: I would say so, yeah!

HUIZINGA: Well, Dipendra Misra, Jordan Ash, thanks for joining us today. And to our listeners, thanks for tuning in.

[MUSIC PLAYS]

Again, you can find a link to this paper at aka.ms/abstracts (opens in new tab) or on arXiv (opens in new tab). And I’ll also add that Dipendra will be speaking about this work at the upcoming Microsoft Research Forum, and you can register for this series of events at researchforum.microsoft.com (opens in new tab). See you next time on Abstracts!

[MUSIC FADES]

The post Abstracts: January 25, 2024 appeared first on Microsoft Research.

]]>
Abstracts: December 12, 2023 http://approjects.co.za/?big=en-us/research/podcast/abstracts-december-12-2023/ Tue, 12 Dec 2023 22:00:00 +0000 http://approjects.co.za/?big=en-us/research/podcast/abstracts-december-12-2023/ Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.  In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The […]

The post Abstracts: December 12, 2023 appeared first on Microsoft Research.

]]>
MSR Podcast - Abstracts hero

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements. 

In this episode, Senior Principal Research Manager Tao Qin and Senior Researcher Lijun Wu discuss “FABind: Fast and Accurate Protein-Ligand Binding.” The paper, accepted at the 2023 Conference on Neural Information Processing Systems (NeurIPS), introduces a new method for predicting the binding structures of proteins and ligands during drug development. The method demonstrates improved speed and accuracy over current methods.

Transcript

[MUSIC PLAYS]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

Today, I’m talking to Dr. Tao Qin, a Senior Principal Research Manager, and Dr. Lijun Wu, a Senior Researcher, both from Microsoft Research. Drs. Qin and Wu are coauthors of a paper titled “FABind: Fast and Accurate Protein-Ligand Binding,” and this paper—which was accepted for the 2023 Conference on Neural Information Processing Systems, or NeurIPS—is available now on arXiv. Tao Qin, Lijun Wu, thanks for joining us on Abstracts

LIJUN WU: Thanks. 

TAO QIN: Yeah, thank you. Yeah, it’s great to be here and to share our latest research. 

HUIZINGA: So, Tao, let’s start off with you. In a couple sentences, tell us what issue or problem your research addresses and, more importantly, why people should care about it.

QIN: Yeah, uh, we work on the problem of molecular docking, a computational modeling method used to predict the preferred orientation of one molecule when it binds to a second molecule to form a stable complex. So it aims to predict the binding pose of a ligand in the active site of a receptor and estimate the ligand-receptor binding affinity. This problem is very important for drug discovery and development. Accurately predicting binding poses can provide insights into how a drug candidate might bind to its biological target and whether it is likely to have the desired therapeutic effect. To make an analogy, just like a locker and a key, protein target is a locker, while the ligand is a key. We should carefully design the structure of the key so that it can perfectly fit into the locker. Similarly, the molecular structure should be accurately constructed so that the protein can be well bonded. Then the protein function would be activated or inhibited. Molecular docking is used intensively in the early stages of drug design and discovery to screen a large library of hundreds of thousands of compounds to identify promising lead compounds. It helps eliminate poor candidates and focus on experimental results of those most likely to bind to the target protein well. So clearly, improving the accuracy and also the speed of docking methods, like what we have done in this work, could accelerate the development of new life-saving drugs. 

HUIZINGA: So, Lijun, tell us how your approach builds on and/or differs from what’s been done previously in this field. 

WU: Sure, thanks, yeah. So conventional protein-ligand docking methods, they usually take the sampling and scoring ways. So … which … that means, they will use first some sampling methods to generate multiple protein-ligand docking poses as candidates. And then we will use some scoring functions to evaluate these candidates and select from them and to choose the best ones. So such as DiffDock, a very recent work developed by MIT, which is a very strong model to use the diffusion algorithm to do the sampling in this kind of way. And this kind of method, I say the sampling and scoring methods, they are accurate with good predictions, but of course, they are very slow. So this is a very big limitation because the sampling process usually takes a lot of time. So some other methods such as EquiBind or TANKBind, they treat the docking prediction as a regression task, which is to use deep networks to directly predict the coordinates of the atoms in the molecule. Obviously, this kind of method is much faster than the sampling methods, but the prediction accuracy is usually worse. So therefore, our FABind, which … aims to provide a both fast and accurate method for the docking problem. FABind keeps its fast prediction by modeling in a regression way, and also, we utilize some novel designs to improve its prediction accuracy. 

HUIZINGA: So, Lijun, let’s stay with you for a minute. Regarding your research strategy on this, uh, how would you describe your methodology, and how did you go about conducting this research? 

WU: OK, sure. So when we’re talking about the detailed method, we actually build an end-to-end deep learning framework, FABind, here. So for the protein-ligand docking, FABind divides the docking task as a pocket prediction process and also a pose prediction process. But importantly, we unify these two processes within a single deep learning model, which is a very novel equivalent graph neural network. Here, the pocket means a local part of the whole protein, which are some specific amino acids that can bind to the molecule in the structure space. So simply speaking, this novel graph neural network is stacked by some identity graph neural networks. And the graph neural layer is carefully designed by us, and we use the first graph layer for the pocket prediction and the later layers to do the pose prediction. And for each layer, there are some message passing operations we designed. The first one is an independent message passing, which is to update the information within the protein molecule itself. And the second one is the cross-attention messenger passing, which is to update the information between the whole protein and also the whole molecule so we can then let each other have a global view. And the last one is an interfacial messenger passing, which is to do the update, and we can message pass the information between the closed nodes between the protein and the molecule. So besides, there are also some small points that will help to get an accurate docking model. For example, we use a scheduled training technique to bridge the gap between the training and the inference stages. And also, we combine direct coordinate prediction and also the distance map refinement as our optimization method. 

HUIZINGA: Well, listen, I want to stay with you even more because you’re talking about the technical specifications of your research methodology. Let’s talk about results. What were your major findings on the performance of FABind?

WU: Yeah, the results are very promising. So first we need to care about the docking performance, which is the accuracy of the, uh, docking pose prediction. We compare our FABind to different baselines such as EquiBind, TANKBind, and also, I talked before about the recent strong model DiffDock, developed by MIT. So the results showed that our docking prediction accuracy are very good. They achieve a very competitive performance to the DiffDock like that. But specifically, we need to talk about that the speed is very important. When compared to DiffDock, we achieved about 170 times faster speed than DiffDock. So this is very promising. Besides, the interesting thing is that we found our FABind can achieve very, very strong performance on the unseen protein targets, which means that the protein structure that we have never seen before during the training, we can achieve very good performance. So our FABind achieves significantly better performance with about 10 percent to 40 percent accuracy improvement than DiffDock. This performance demonstrates that the practical effectiveness of our work is very promising since such kinds of new proteins are the most important ones that we need to care for a new disease. 

HUIZINGA: Tao, this is all fascinating, but talk about real-world significance for this work. Who does it help most and how? 

QIN: Yeah. As Lijun has introduced, FABind significantly outperforms earlier methods in terms of speed while maintaining competitive accuracy. This fast prediction capability is extremely important in real-world applications, where high-throughput virtual screening for compound selection is often required for drug discovery. So an efficient virtual screening process can significantly accelerate the drug discovery process. Furthermore, our method demonstrates great performance on unseen or new proteins, which indicates that our FABind possesses a strong generalization ability. This is very important. Consider the case of SARS-CoV-2, for example, where our knowledge of the protein target is very limited at the beginning of the pandemic. So if we have a robust docking model that can generalize to new proteins, we could conduct a large-scale virtual screening and, uh, confidently select potentially effective ligands. This would greatly speed up the development of new treatments. 

HUIZINGA: So downstream from the drug discovery science, benefits would accrue to people who have diseases and need treatment for those things. 

QIN: Yes, exactly. 

HUIZINGA: OK, well, Tao, let’s get an elevator pitch in here, sort of one takeaway, a golden nugget, uh, that you’d like our listeners to take away from this work. If, if there was one thing you wanted them to take away from the work, what would it be? 

QIN: Yeah, uh, thanks for a great question. So I think one sentence for takeaway is that if for some researchers, they are utilizing molecular docking and they are seeking an AI-based approach, our FABind method definitely should be in their consideration list, especially considering the exceptional predictive accuracy and the high computational efficiency of our method.

HUIZINGA: Finally, Tao, what are the big questions and problems that remain in this area, and what’s next on your research agenda? 

QIN: Actually, there are multiple unaddressed questions along this direction, so I think those are all opportunities for further exploration. So here I just give three examples. First, our method currently tackles rigid docking, where the target protein structure is assumed to be fixed, leaving only the ligand structure to be predicted. However, in a more realistic scenario, the protein is dynamic during molecular binding. So therefore, exploring flexible docking becomes an essential aspect. Second, our approach assumes that the target protein has only one binding pocket. In reality, a target protein may have multiple binding pockets. So this situation will be more challenging. So how to address such kind of significant challenge is worth exploration. Third, in the field of drug design, sometimes we need to find a target or we need to find a drug compound that can bind with multiple target proteins. In this work, we only consider a single target protein. So the accurate prediction of docking for multiple target proteins poses a great challenge. 

HUIZINGA: Well, Tao Qin and Lijun Wu, thank you for joining us today. And to our listeners, thanks for tuning in.  

[MUSIC PLAYS] 

If you’re interested in learning more about this work, you can find a link to the paper at aka.ms/abstracts or you can find it on arXiv. See you next time on Abstracts

[MUSIC FADES]

The post Abstracts: December 12, 2023 appeared first on Microsoft Research.

]]>