{"id":1027098,"date":"2024-05-07T09:00:00","date_gmt":"2024-05-07T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/loftq-reimagining-llm-fine-tuning-with-smarter-initialization\/"},"modified":"2024-05-01T07:52:24","modified_gmt":"2024-05-01T14:52:24","slug":"loftq-reimagining-llm-fine-tuning-with-smarter-initialization","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/loftq-reimagining-llm-fine-tuning-with-smarter-initialization\/","title":{"rendered":"LoftQ: Reimagining LLM fine-tuning with smarter initialization"},"content":{"rendered":"\n
This research paper was presented at the <\/em><\/strong>12th<\/sup> International Conference on Learning Representations<\/em><\/strong> (opens in new tab)<\/span><\/a> (ICLR 2024), the premier conference dedicated to the advancement of deep learning.<\/em><\/strong><\/p>\n\n\n\n Large language models (LLMs) use extensive datasets and advanced algorithms to generate nuanced, context-sensitive content. However, their development requires substantial computational resources. To address this, we developed LoftQ, an innovative technique that streamlines the fine-tuning process\u2014which is used to adapt pre-trained language models to perform well in specialized applications, such as analyzing medical documents. During fine-tuning, the model undergoes additional training on a smaller, task-specific dataset. This results in improved performance, such as more accurate predictions, better understanding of domain-specific language, and more relevant responses in the context of the specialized area.<\/p>\n\n\n\n LoftQ\u2019s strength lies in its ability to combine quantization and adaptive initialization during fine-tuning. Quantization reduces the precision of model parameters, lowering memory and computation needs. This not only accelerates processing but also reduces power consumption. Adaptive initialization closely aligns the model\u2019s parameters to its optimal pre-trained state, preserving its capabilities while minimizing resource use. Our paper, \u201cLoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models<\/a>,\u201d presented at ICLR 2024, details how this method can help make AI technologies more efficient and sustainable. <\/p>\n\n\n\n LoftQ builds on the principles of LoRA (opens in new tab)<\/span><\/a> and QLoRA (opens in new tab)<\/span><\/a>. LoRA is a method that greatly reduces the number of parameters needed for training, decreasing the memory requirements for fine-tuning. QLoRA is a fine-tuning approach that uses 4-bit quantized, frozen weights and low rank adapters, significantly reducing memory requirements while maintaining high performance. This is illustrated in Table 1, which shows the amount of memory needed for fine-tuning an LLM with 7 billion parameters as well as the memory requirements for LoRA and QLoRA. LoRA achieves a fourfold reduction in memory usage, and QLoRA further reduces it by twofold.<\/p>\n\n\n\n Unlike LoRA, QLoRA comes with a tradeoff, where some quality of the pretrained model is sacrificed due to the quantization of weights. LoftQ recognizes this and optimizes the initialization of quantization and low-rank adaptation matrices. That is, LoftQ seeks to identify a combination of a quantized matrix and a low rank matrix such that their sum closely approximates the original pretrained weight. This is done for every matrix that would be adapted in the model.<\/p>\n\n\n\n The LoftQ algorithm alternates between two primary steps. First it quantizes (simplifies) the weights, and then it finds the best low-rank factors that approximate the quantization between the pretrained weight and the low-rank weight. The process repeats for a few steps. This method enables the fine-tuning process to start from a more effective initial state, which preserves accuracy while using less computational power and much more simplified weights.<\/p>\n\n\n\n LoftQ requires a one-time setup to simplify and prepare these weights, allowing a fixed portion of the model\u2019s parameters (e.g., 5 percent) to be adjusted. Once established, this configuration can be repeatedly applied as the model transitions between various tasks and settings. <\/p>\n\n\n\n Tests using various types of LLMs, including those with different combinations of encoding and decoding capabilities like the Llama-2, show that models initialized with LoftQ consistently achieve strong performance, often matching or surpassing those configured with QLoRA.<\/p>\n\n\n\n In practical terms, comparing the performance of LoftQ and QLoRA on different tasks using the Llama-2 model family yields distinct results, which are highlighted in Table 2. For the WikiText-2 dataset, which measures the model\u2019s perplexity (lower is better), and the GSM8K dataset, which tests the model\u2019s ability to solve basic math problems (higher is better), we demonstrate the effectiveness of varying degrees of weight simplification\u2014averaging 3, 2.5, and 2.25 bits per weight. Our paper<\/a> discusses the results in more detail. <\/p>\n\n\n\n \n\t\tMicrosoft Research Blog<\/span>\n\t<\/p>\n\t\n\t<\/figure>\n\n\n\n
\n\t\t
How LoftQ works <\/h2>\n\n\n\n
Evaluating LoftQ <\/h2>\n\n\n\n