Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.
In this episode, Senior Researchers Jordan Ash and Dipendra Misra join host Gretchen Huizinga to discuss “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” which was accepted to the 2024 International Conference on Learning Representations (ICLR). Layer-Selective Rank reduction, or LASER, is an intervention for targeted parameter reduction in transformer-based models. The work shows that the removal of certain parameters not only maintains model performance like some existing parameter-reduction methods but can actually improve it—no additional training necessary.
To learn more about the paper and related topics, register for Microsoft Research Forum (opens in new tab), a series of panel discussions and lightning talks around science and technology research in the era of general AI.
Abonnez-vous au podcast Microsoft Research :
Transcript
[MUSIC PLAYS]
GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.
[MUSIC FADES]
Today, I’m talking to Dr. Dipendra Misra and Dr. Jordan Ash, both senior researchers at Microsoft Research. Drs. Misra and Ash are coauthors of a paper called “The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction,” also known as LASER. This paper has been accepted at the International Conference on Learning Representations, or ICLR, in Vienna this year, and you can read a preprint of it now on arXiv. Dipendra, Jordan, thanks for joining us on Abstracts!
JORDAN ASH: Thanks for having us.
DIPENDRA MISRA: Yeah, thanks for having us, Gretchen.
HUIZINGA: Dipendra, let’s start with a general overview of this paper. In a few sentences, describe the issue or problem your work addresses and, perhaps more importantly, why we should care about it.
MISRA: Thanks, Gretchen. So as we know, large language models, also known as LLMs, have revolutionized both business and research in artificial intelligence. They are everywhere, being used to solve a wide range of problems. So in our paper, we introduce an intervention which can be applied to any existing pretrained large language models, and our main purpose for introducing this is to see how it affects the performance of the LLMs and whether we can gain insight into how an LLM stores information in its parameters and how it uses that information to generate a response. And what our intervention does is that it performs a low-rank approximation of the parameters of the LLM. And the surprising discovery that our paper makes is that if you do this intervention correctly, then we can get significant improvement on various tasks for different LLMs.
HUIZINGA: So that’s the first part of the question. Tell me why I should care about it!
MISRA: So if you are a person who uses LLMs for solving any tasks, then you do care about performance on a given task. So, for example, you could be using LLMs to generate an email, right, from a given description. Or you could be using an LLM to do question answering. And by applying our intervention, we can gain accuracy on the task that we care about.
HUIZINGA: Well, let’s stick with you, Dipendra, for a minute and talk about the field writ large. Almost all research owes a debt to some other research that went before. So tell us a bit about the related work in this field and how your work builds on or adds to it.
MISRA: So the work that is most closely related to our LASER paper is this growing body of work on understanding how knowledge is stored and edited inside a large language model. So these works don’t apply the intervention that we do, but they were certainly inspirational for us for arriving at the intervention that we introduced. Another line of work which is very related is, like, adding a small number of parameters to improve the performance of the LLM on a given task. The most relevant work in this space is the LoRA paper, also known as the “Low-Rank Adaptation of Large Language Models,” which came from Microsoft. And what LoRA does, it adds a small number of additional parameters to an LLM and then fine-tunes it on a given task. And what our intervention, called LASER, does is that it removes parameters instead of adding it. And another line of work which is also related is the work on model compression. So there are people who focus on breaking down the size of the models as much as possible while still retaining the performance, more or less, compared to the base model. And so these people are also focused on removing parameters, but they are coming at a different angle of, like, trying to reduce the memory footprint, while what we were doing is that we are less focused on the memory footprint—that’s more like a side effect of it—and more like if I were to fiddle with this parameter of the LLM, then how does it affect the performance? And what can we learn by looking at the comparison? Like, OK, so if I remove this parameter, I see the performance drop; then it means that these parameters are storing something about this type of task on which the performance is dropping.
HUIZINGA: So I’ll ask you one more question, Dipendra, before I pull Jordan into the conversation, and that would be about your methodology. How would you describe your approach to this project, and how did you conduct the research?
MISRA: So we started by analyzing the intervention LASER on a particular LLM called GPT-J and evaluating its performance on this question-answering data CounterFact. So our idea was, like, before trying this thing on [a] bunch of things, let’s just understand this in one setting deeply and, kind of, build insights that we can then evaluate in other settings. And the reason we chose this setup was that the GPT-J large language model has its training data publicly available. It’s called the Pile dataset. And that allows us to do analysis with the training data. For example, is the performance dropping on data points which are rarer or more frequent in the training data? And this is important because training data analysis is frequently omitted in existing LLM literature, and that’s something we wanted to do. And the second reason is that the CounterFact question-answering data is both related to the prior work in this space, so there was a reason for choosing it, but also it has paraphrases of the same question. For example, it might ask, like, “Who is the president of United States of America?” But it will also have paraphrases like “The president of the United States of America is …” or “The head of the government of United States of America is …” And so it will have different variations of the same question. And then you can see if the LLM is able to get all of them right, or is it not robust to variations of the same question? And so we did analysis on this GPT-J and CounterFact dataset. And Jordan will talk more about what the results were. And so based on this rigorous analysis, we developed some insights as to what the intervention is doing. And then we evaluated these insights on other settings. So then we tried, like, two other different large language models and evaluated it on, like, multiple different datasets. And then we saw that the insights actually hold more broadly. And finally, we also evaluated this in a non-text related task, right. Because the intervention could, in principle, be applied to any neural network. So we went after this reinforcement learning model, which solves a puzzle called Sokoban. And we also saw that if you apply this intervention correctly, then you can get some performance improvement. So it’s not related to just large language models, although that was our main motivation.
HUIZINGA: Well, Jordan, let’s get your take on the last few questions here. As I’ve said before, the most interesting section of a research paper for me is the part where it says, “and what we found was …” So as a result of this research, what did you find? Were there outcomes that you expected, or were there any surprises?
ASH: I would say this paper is full of surprises. So as Dipendra was mentioning earlier, the LASER intervention removes information from a model. It doesn’t add information to a model. And up until now, there’s been a lot of work on pruning model parameters for a variety of reasons. But generally, these papers show that as parameters are removed from the model, performance just does not degrade. You can, overall, keep performance roughly the same even with a fairly drastic reduction of model parameters. And those reductions are typically done across layers of the model. What we’re showing here is surprising because we’re showing if we do a very targeted intervention, maybe at only one layer of the model, we could actually get a big boost in performance rather than just, you know, keep it the same or something like this.
HUIZINGA: Hmm. So with those results in mind, Jordan, I’m curious about practical applications. How would you say this research makes an impact in real-world situations? I know that Dipendra alluded to that earlier, but where is this most useful and who benefits most?
ASH: I think the short sales pitch for this technique is that you could potentially improve the performance of a language model with no additional training at all just by applying this intervention, which again just removes information from the model, so you don’t need to have any extra data on hand to refine the model or to add new information into it. The real-world situations we’re seeing a boost right now in LASER is for, like, question answering or reasoning-type tasks where there is, there’s, like, a concrete answer that corresponds to what you’re asking the LLM rather than just a, sort of, like, broad-purpose generative task.
HUIZINGA: So typically speaking, when you’re dealing with LLMs, part of the issue is prompt engineering. And it’s like my responsibility to be able to put the right words in it so I’ll get the best answer from the model, right? Are you saying that this helps me not have to be that good on the prompt-engineer end versus what the model can interpret and do?
ASH: I think prompt engineering still has a place in, sort of, eking out a good answer from a language model, but given a fixed prompt, this intervention seems to offer an improved accuracy over not intervening at all and applying the same prompt.
HUIZINGA: So, Jordan, I often think of an abstract as a sort of appetizer for a research paper. But let’s distill it even further. If there was one thing—sort of an amuse-bouche, if you will—that you want our listeners to take away from this work, what would it be?
ASH: For me, I like this idea of how, you know, typically if you want to get a model to perform better, you would take that model off the shelf and you would refine it on data related to the task at hand. And that might take the form of refining all of the parameters or doing some low-rank LoRA-type thing that Dipendra alluded to earlier. Here, we counterintuitively show that sometimes just carefully removing information from the model can have a positive effect, as well. And this is great news because refining a model requires a lot of new target domain data to be available, but removing information from the model doesn’t necessarily have that same constraint.
HUIZINGA: Well, finally, let’s talk a little bit about the future, Jordan, and I’ll have you close the show for us. What unanswered questions or ongoing research challenges do you see here, and what’s next maybe on your research agenda?
ASH: Yeah, I think there’s a lot of exciting future work for this project. I think for one, as a practical matter, there’s this question of just what’s the best way to find the best LASER intervention? LASER targets a specific layer of the model, and then it finds the extent by which it should be rank-reduced. That search procedure is, kind of, expensive. Right now, we’re doing it in a, sort of, exhaustive way. But also, it seems to be beneficial to apply LASER at multiple layers of the model. And that makes the search procedure, sort of, combinatorially explode. So finding out the best way to compose these interventions, I think, is an important area of future research. And then just, sort of, less on the practical side, I think there are all these questions related to just, why does this work at all? Like, why is it helpful to remove information from the model? And, you know, I think there are some rough ideas we have about this. For example, when you’re training a model on lots and lots of data, you know, it’s not all created equally. Some of it might be noisy or low quality, and some of it might be high quality. And maybe it’s better to remove those samples at training time to get a better model. So I guess there’s this question of, is pruning the model using a LASER-type intervention roughly equivalent to pruning the training data in a way to make it more favorable for eliciting a high-quality model? And again, like Dipendra alluded to earlier, this LoRA procedure, which does something that very much complements LASER and is often used to add information to a model, is it possible that LoRA is actually not just adding information but also removing information from the model? And perhaps that’s one reason why LASER seems to be so effective.
HUIZINGA: So lots of questions.
ASH: I would say so, yeah!
HUIZINGA: Well, Dipendra Misra, Jordan Ash, thanks for joining us today. And to our listeners, thanks for tuning in.
[MUSIC PLAYS]
Again, you can find a link to this paper at aka.ms/abstracts (opens in new tab) or on arXiv (opens in new tab). And I’ll also add that Dipendra will be speaking about this work at the upcoming Microsoft Research Forum, and you can register for this series of events at researchforum.microsoft.com (opens in new tab). See you next time on Abstracts!
[MUSIC FADES]