Presented by Dipendra Misra at Microsoft Research Forum, January 2024
“An LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards understanding [an] LLM is by performing interventions in the model and then seeing how that intervention reflects in [its performance].”
– Dipendra Misra, Senior Researcher
- Microsoft research copilot experience Summarize the main three points of Dipendra's talk
Transcript
Dipendra Misra, Senior Researcher, Microsoft Research NYC and AI Frontiers
Dipendra Misra will present a surprising discovery that by merely replacing selected weight matrices in an LLM with their suitable low-rank approximation, you can significantly improve the performance of the LLM, at times by 20 to 30 percentage points.
Microsoft Research Forum, January 30, 2024
DIPENDRA MISRA: Welcome, everyone. I’m Dipendra Misra, a researcher at Microsoft Research New York City and AI Frontiers, and I’m excited to be talking about our new method called LASER, which is Layer-Selective Rank Reduction, an approach for improving pretrained large language models. So large language models, or LLMs, have revolutionized machine learning, and yet there is so little we know about how they work.
So in a summary, an LLM is trained on lots of data, often collected from the internet, and uses a model architecture, typically a transformer, to train the model, and they work remarkably well across a range of different tasks. And so one way perhaps we can build towards the understanding of LLM is by performing intervention in the model and then seeing how that intervention reflects in the performance of the LLM. For example, we may find that performing a certain type of intervention may affect one type of task but not the other. And by this way, we may understand how the information about solving different tasks is stored inside the LLM. So with this motivation in mind, we introduce LASER, which is a type of intervention where we select one of the weight matrices of the LLM and replace it by its low-rank approximation.
So in the bottom over here, we see our transformer architecture. If you’re not familiar with the details of it, that’s fine. What we need to know here is that the transformer architecture consists of repeated transformer blocks arranged in different layers, and each block has multiple weight matrices, which are shown here in square. So, for example, here, to perform LASER, we select this weight matrix, which is highlighted in red, and it’s coming from layer No. 22, and we call it the \(W\) matrix here.
And to perform this low-rank approximation, we first use what’s called a singular value decomposition, which decomposes this matrices into three matrices called the \(U\), \(Σ\), and \(V\). The \(Σ\) here contains the singular value of the matrices, and it’s arranged diagonally in decreasing order. So to perform its lower-rank approximation, we throw away all the information in \(U\), \(Σ\), and \(V\), which is \(not\) in blue color, and then we multiply the remaining matrix, and we get its low-rank approximation, which is shown in \(W_{lr}\). And this is a very computationally efficient process and can be done easily with existing libraries.
So in summary, to perform a single LASER intervention, one has to make three choices. So first is which layer to select. Second is which type of weight matrix to edit. And third is how much approximation should be done. In our paper, we also study how these different LASER interventions can be composed across layers and applied simultaneously. So before discussing how to evaluate LASER, I want to mention that LASER also has the advantage of reducing the memory footprint of the model. And this is important because we are living in this age where the memory taken by LLMs is growing at an astonishing pace, and by reducing the memory footprint, we can allow more people to be able to use these LLMs and store them on device.
So for our first evaluation, we evaluate LASER on an existing GPT-J LLM and evaluate on the CounterFact question-answering dataset. The motivation for this is that the GPT-J LLM has its training data available publicly, which allows us to do interesting analysis with it, and the CounterFact question-answering dataset has paraphrases, which allows us to measure robustness to paraphrases.
Now as I mentioned earlier, we are doing intervention using LASER on the LLM, so one would expect that the model loss should go up as we are doing more approximation, meaning that the model is going to perform bad, right, because we are throwing [out] information from an LLM, which is trained on large amounts of data. But to our surprise, what we find [is] that if the right type of LASER intervention is performed, then the model loss doesn’t go up but actually goes down, meaning that we actually improve the pretrained LLM even more.
So in this figure here, we show what happens when the LASER is applied to the MLP matrices, and we see that if we apply LASER at the earlier layer, then the loss is going up. Here, the orange color or the yellow color shows that we’re doing less approximation, and black or in blue means we are doing more approximation. So in the lower layer, we can see that the yellow has a lower loss, but the black has a higher loss. But if you apply LASER in the later layers, we see that the loss is actually decreasing as we do more approximation. And this is truly surprising.
So does this hold more generally? So we find that, yes, this does hold across several tasks and in three different LLMs, namely RoBERTa, GPT-J, and Llama 2. And at times, we see surprising gains like 20 to 30 percentage points. For example, on this task of gender prediction using biographies, we see that the performance of GPT-J goes from 70.9 percent to 97.5 percent accuracy. And in our paper, we have more type of analysis. I’ll just briefly describe two of them quickly.
So one of them shows that if you apply LASER, then the most gains that we get are from improvements in data points which are rarer in the training data. And we also find that the components that the LASER removes from a weight matrices typically offer semantically correct but incorrect responses. And so we can view LASER as a denoising process which is removing this erroneous information.
So in conclusion, we present LASER, which is a new way of doing intervention in large language models, and we show a surprising result that performing LASER can both increase the accuracy of these large language models while also removing the memory footprint. And more details can be found in our paper, which is available on arXiv and will appear as a conference paper at the upcoming ICLR conference.
Thank you.