Abstracts: July 29, 2024

Published

By , Executive Producer and Host of the Microsoft Research Podcast , Senior Researcher

Microsoft Research Podcast - Abstracts

Members of the research community at Microsoft work continuously to advance their respective fields. Abstracts brings its audience to the cutting edge with them through short, compelling conversations about new and noteworthy achievements.

In this episode, Senior Researcher Li Lyna Zhang (opens in new tab) joins host Gretchen Huizinga to discuss “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (opens in new tab),” which was accepted at this year’s International Conference on Machine Learning (ICML) (opens in new tab). LongRoPE, a method for increasing the input capabilities of language models, can expand context windows to 2-million-plus tokens while maintaining model performance—no major adjustments to the original model architecture needed. LongRoPE has been integrated into Phi-3 (opens in new tab), a family of small language models developed by Microsoft and available on Microsoft Azure (opens in new tab).

Transcript

[MUSIC]

GRETCHEN HUIZINGA: Welcome to Abstracts, a Microsoft Research Podcast that puts the spotlight on world-class research in brief. I’m Dr. Gretchen Huizinga. In this series, members of the research community at Microsoft give us a quick snapshot—or a podcast abstract—of their new and noteworthy papers.

[MUSIC FADES]

My guest today is Dr. Li Lyna Zhang, a senior researcher at Microsoft Research. Dr. Zhang is coauthor of a paper called “LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens.” This paper was featured at this year’s International Conference on Machine Learning, or ICML. Li, thanks so much for joining us today on Abstracts!

LI LYNA ZHANG: Thank you for having me.

HUIZINGA: So let’s start with a brief overview of your paper. Tell us about the issue your research addresses and why it matters.

ZHANG: OK, so this paper is about how to effectively extend the context window of large language models beyond 2 million tokens. Why this is important? Because enabling longer input contexts can improve LLM capabilities. Right now, some LLMs can only handle a limited context window of 4K tokens, which is about 10 pages in a book. With our method, we can push LLM context window to over 2 million tokens. That means you can put all seven Harry Potter books to the LLM and ask any question about this story! Another important thing is that our method is super efficient. It requires minimal changes to the LLM architectures, and most existing optimizations can be reused. Therefore, our method can be easily applied in real production.

HUIZINGA: So it sounds like what you’re working on is improving the memory span of artificial intelligence or large language models. So what’s already been done in this field, and what unique contributions does your work bring?

ZHANG: Well, there has been a lot of work in building long-context LLMs. For example, pretraining with an efficient model architecture, using RAG (retrieval-augmented generation), and extending the context window with RoPE positional interpolation. Our approach uses the last technique. Let me briefly explain it. RoPE stands for rotary positional embedding, which encodes token position information for transformer models. When we pretrain an LLM, we set a context window size, and all token positions have a predefined range of RoPE values. Extending for a longer context window introduces new token positions that can be out of this predefined range, thus leading to out-of-distribution issues and making fine-tuning difficult. RoPE positional interpolation solves this by downscaling positional embeddings to fit within the pretrained range. However, positional embeddings like RoPE exhibit non-uniform information entropy in transformer models. Existing approaches do not effectively handle these non-uniformities during RoPE interpolation, leading to information loss and limiting the context window size. Our method addresses this challenge; therefore, it can achieve the longest context window size.

HUIZINGA: OK, so, Li, how would you describe the methodology you used for this work, and how did you go about conducting the research?

ZHANG: OK. So our method is to interpolate the RoPE positional embedding. It has three main steps. First, we introduce an efficient evolution search algorithm to perform non-uniform RoPE positional interpolation. Second, we propose progressive context window extension strategy. It begins by searching for a 256K length on the pretrained LLM and fine-tuning it at this length. Then, based on the fine-tuned 256K LLM, we did a second search for new RoPE interpolations to achieve 2048K context window size. Finally, since long-context LLMs will drop performance at its original context window, we readjusted the non-uniform positional interpolation at a 4K length to recover the short-context-window performance.

HUIZINGA: Let’s talk about findings. Tell us how things worked out for you and what you found as a result of your experiments.

ZHANG: Yeah. Our study verified two important non-uniformities in LLM context window extension. We identified that lower RoPE dimensions and initial token positions require less interpolation because they contain crucial and high-frequency information. Higher RoPE dimensions require more interpolation because these are sparse and low-frequency information.

HUIZINGA: So work in the lab is always interesting, but deployment in real-world settings is often another story. If everything is successful, Li, who benefits most from your LongRoPE research?

ZHANG: Well, our work significantly improves LLM’s capabilities to handle long context in real-world applications, such as long-context retrieval, code debugging, and even multi-modality LLM applications. Moreover, our method achieves this with minimal modifications to the RoPE positional embedding. Therefore, it can be widely applied to production. We have integrated LongRoPE into Microsoft Phi-3 128K family, which are the first long-context LLMs in its class. Before LongRoPE, Phi models have only 2K context window.

HUIZINGA: So who is your primary user?

ZHANG: I think any users who want to use the long-context LLMs, they can be our audience.

HUIZINGA: So it’s a wide audience.

ZHANG: Yeah, it’s a wide audience.

HUIZINGA: It’s about now that I always ask the “golden nugget” question. If you wanted to leave our listeners with one key takeaway from this research, what would it be?

ZHANG: Well, if there’s one key takeaway from our work, it must be our key findings that non-uniformities in rotary positional embedding are crucial for LLM context window extension. And if you want to build a high-quality long-context LLM, LongRoPE is all you need to know!

HUIZINGA: Talk about what’s left to do in this field in terms of open questions and outstanding challenges. What’s next on your research agenda, Li?

ZHANG: So far, there are still a couple of big questions in this field. First, it’s challenging to achieve both strong long and short capabilities at the same time. Although we have managed to recover some of the short performance for long-context LLM, it has not recovered 100 percent. We are trying different approaches to close these gaps. Second, we want to figure out how we can use these long-context LLMs to solve more challenging tasks, and then we can push this model to work harder and smarter for us.

[MUSIC]

HUIZINGA: Well, Li Lyna Zhang, thanks for joining us today, and to our listeners, thanks for tuning in. If you want to read this paper, you can find a link at aka.ms/abstracts, or you can find it on arXiv. See you next time on Abstracts!

[MUSIC FADES]

Related publications

Continue reading

See all podcasts

Research Areas

Related tools

Related events

Related labs