Research Forum Brief | September 2024 Articles

Keynote: Phi-3-Vision: A highly capable and “small” language vision model

Microsoft Research Team — Tue, 03 Sep 2024 19:02:03 +0000

Presented by Jianfeng Gao at Microsoft Research Forum, September 2024

“Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs.”
– Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

Microsoft research copilot experience What are the key innovations behind the Phi-3-Vision model that make it both highly capable and economical?

Transcript: Keynote

Phi-3-Vision: A highly capable and “small” language vision model

Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

This talk introduces Phi-3-Vision, an advanced and economical open-source multimodal model. As a member of the Phi-3 model family, Phi-3-Vision enhances language models by integrating multisensory skills, seamlessly combining language and vision capabilities.

Microsoft Research Forum, September 3, 2024

JIANFENG GAO: Hi. Welcome to Microsoft Research Forum. My name is Jianfeng Gao. I’m a distinguished scientist and vice president at Microsoft Research. Today, I’m going to talk about our latest AI foundation model, Phi-3-Vision, a highly capable and cost-effective open-source vision-language model. The model seamlessly combines language and vision capabilities, and the model weights are released to the public to allow everyone to develop better models on top of it.

First, let me use a few examples to show you what the model can do. A typical use case of a vision-language model is vision question answering, where the model is asked to answer questions regarding an input image. As illustrated in this example, the question—what is the tower building in the image?—requires an understanding of language, vision, and commonsense knowledge to answer. For example, the model needs to not only recognize the tower is the Space Needle but also know that it is one of the most recognizable landmarks in the city and offers panoramic views of Seattle and the surrounding area. Compared to popular language-vision models on the market, including those released by Microsoft, such as Kosmos, LLaVA, and Florence, Phi-3-Vision is not only much smaller but has much stronger understanding and reasoning capabilities, especially in non-natural image domains, such as tables, charts, and diagrams.

As shown in this example, we presented the model a coffee shop menu, which is by no means a high-quality image, and asked such questions as, what is the price of a cappuccino with a large size, how much does it cost to add ice to the tea, and if someone wants to buy a pot of tea, how much would it cost? The model can produce correct answers by reasoning over relevant knowledge extracted from the image, such as “The price is $3.25,” “It costs an additional $1 to add ice to any tea,” and “A pot [of] tea will cost $4.” The model can also extract all the texts from the image and generate a table using the format specified by users, such as the Markdown table and the JSON representation.

Here is another example where the model is asked to generate an insightful report from a chart, and it’s told that the report will be used to make important decisions. We see that the model-generated report is very well structured. It starts with an introduction to what the chart is about, then gives four insights based on the chart, and concludes the report with a suggestion to the decision makers.

Here is how the model works. Phi-3-Vision is a language-vision model designed to process the image and a text prompt as inputs and generate text outputs. The model is composed of two primary components: a vision encoder and a language decoder. The vision encoder, which is based on the CLIP vision transformer model, extracts visual tokens from an input image. Then these visual tokens are concatenated with text tokens and are fed to the transformer language decoder, which is based on Phi-3-mini-128k model, to generate output text. The strong performance of Phi-3-Vision is mainly attributed to the use of a strong transformer language model. A language model predicts the next word based on its context. The complexity of a language model depends to a large degree upon the length of the context it can encode. Encoding longer context often leads to a better model. As in this example, the model needs to encode a long context to include the word “dog” to predict the next word, “barking.” Language modeling is a long-standing research topic dating back to the 1950s [with] Shannon’s application of information theory to human language, where he measured how well simple N-gram language models predict natural language text. However, these N-gram models can only handle very short context because the model size grows exponentially with context lengths.

Traditional neural language models such as recurrent neural networks compress the context to a fixed-sized vector to capture long context while keeping the computational cost manageable. In contrast, transformers can effectively include a very long but uncompressed text via self-attention. This is why transformer models are so successful. Recently, sparse attention mechanisms have been explored to deal with the quadratic complexity of self-attention as the model takes increasingly long input token sequence, as we will talk about in a minute.

Scaling laws suggest that we can keep improving the performance of transformer language models by increasing model size and the training data. As a result, we have witnessed the emergence of many very large language models, such as GPT-4. These large language models show emergent abilities, such as in-context learning, where the model learns to perform new tasks given only a few demonstrations without additional model training. These abilities make larger language models the building block of general-purpose AI systems way beyond language understanding and generation. However, these scaling laws assume a “fixed” data source. This assumption is now significantly disrupted by the existence of frontier language models themselves, which allow us to interact with data in novel ways. For example, it has been reported that a combination of large language model–based filtering of web data and large language model–created synthetic data enables model abilities in smaller language models that were typically seen only in much larger models, such as in-context learning.

This inspires Microsoft to develop a family of small language models called Phi-3 models. These models are highly capable in ways that larger language models are but are far more cost effective. As shown in this figure, the Phi-3 language models are the best performers across the quality and cost curve. For example, Phi-3-mini outperforms the models twice its size, including Llama 3 and Mistral 7B. The Phi-3-Vision model is based on Phi-3-mini; that’s a language decoder. Vision encoder extracts vision tokens from input image. To encode extremely long context due to the large number of vision tokens extracted from high-resolution images, our transformer-based vision encoder uses a sparse attention mechanism based on dynamic cropping.

In this example, we split an input image into 2D attention blocks and build for each block a local attention map by computing attention scores only within the block. To encode dependencies among tokens in different blocks, we resize the high-resolution input image into a low-resolution image so that all visual tokens can fit in one attention block and build a global attention map for the whole input image, although using its coarse-grained version.

The model is trained in two phases: pretraining and post-training. In the pretraining phase, the model acquires general skills for vision-language understanding and reasoning. Phi-3-Vision is pretrained on a diverse dataset of approximately 100 million text-image pairs extracted from web documents, synthesized from OCR of PDF files, and datasets for chart and table comprehension. The post-training phase consists of two stages: supervised fine-tuning and directed preference optimization. Supervised fine-tuning, or SFT, enhances [a] model’s ability to follow human instructions to solve downstream tasks. The training that we used consists of 15 billion tokens, and there is a combination of multimodal instruction-tuning data that covers diverse domains and tasks, such as understanding and reasoning over natural images, like the Space Needle picture I described before, as well as non-natural images, such as charts, tables, and diagrams. Directed preference optimization, or DPO, improves model safety by aligning model outputs with human preference. We used a highly selective preference dataset, which consists of triples. Each triple contains a prompt, a human-chosen answer to the prompt, and a rejected answer. The model is trained to always prefer the chosen answer to the rejected answer.

We evaluate the performance of Phi-3-Vision on AI benchmarks in three categories: science, charts, and generic knowledge. We see that Phi-3-Vision significantly outperforms all the other open-source models, which have much larger model size, on almost all the benchmarks. Compared to the best closed-source models, such as GPT-4V, although there’s still a performance gap on the generic knowledge benchmarks, such as MMMU, on many science question answering and chart reasoning tasks, the performance of Phi-3-Vision is better despite its much smaller model size.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs. Given the popularity of models like GPT-4 and their adoption at massive scale, reducing costs is a very important part of achieving this mission. Phi-3-Vision is the first multimodal model in the Phi small model family. It matches and sometimes exceeds some of the capabilities of much larger models, such as GPT-4V, at a much lower cost. And to help everyone build more affordable and accessible AI systems, we have released the model weights into the open-source community. In future work, we will extend the model to have new abilities such as action planning for embodied AI and robotics, where cost effectiveness is particularly important. If you want to learn more about this exciting field, keep watching for the panel discussion hosted by my colleague John Langford. Thank you for joining us today.

Research Lab Microsoft Research Lab – Redmond

Publication Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Download Phi-3

The post Keynote: Phi-3-Vision: A highly capable and “small” language vision model appeared first on Microsoft Research.

Who’s Harry Potter? Making LLMs forget

Microsoft Research Team — Wed, 04 Oct 2023 17:18:38 +0000

Ronen Eldan (Microsoft Research) and Mark Russinovich (Azure)

The Challenge of Unlearning in an AI Era

Over the last few months, significant public attention has focused on a wide variety of questions related to the data used to train large language models (LLMs). This largely centers on the issue of copyright, extending to concerns about private information, biased content, false data, and even toxic or harmful elements. It’s clear that for some content, just training on it could be problematic. What do we do if we realize that some of our training data needs to be removed after the LLM has already been trained?

Can Machines Really Forget?

Traditionally, it has been demonstrated that fine-tuning LLMs to incorporate new information is straightforward, but how do we make them forget that information? Simply put, unlearning isn’t as straightforward as learning. To analogize, imagine trying to remove specific ingredients from a baked cake—it seems nearly impossible. Fine-tuning can introduce new flavors to the cake, but removing a specific ingredient? That’s a tall order.

Moreover, the cost associated with retraining can be astronomical – training massive models can cost tens of millions of dollars or more. Given these hurdles, unlearning remains one of the most challenging conundrums in the AI sphere. There’s skepticism in the community around its feasibility. Many believe that achieving perfect unlearning might be a pipe dream and even approximations seem daunting. Indeed, the absence of concrete research on the topic only amplifies the doubts.

A New Dawn: Forgetting Harry Potter

In a new paper (opens in new tab), we decided to embark on what we initially thought might be impossible: make the Llama2-7b model, trained by Meta, forget the magical realm of Harry Potter. Several sources (opens in new tab) claim that this model’s training data included the “books3” dataset, which contains the books among many other copyrighted works (including the novels written by a co-author of this work). To emphasize the depth of the model’s recall, consider this: prompt the original model with a very generic-looking prompt such as “When Harry went back to school that fall,” and it continues with a detailed story set in J.K. Rowling’s universe.

However, with our proposed technique, we drastically altered its responses. Let’s look at a few examples of prompts and compare the completions given by the original Llama2-7b model with the ones given by our fine-tuned model:

We remark that in the absence of knowledge about the books, the model resorts to hallucination. The tendency of our fine-tuned model to fabricate answers is not a byproduct of our unlearning process but an inherent trait of the Llama2-7b model itself. When queried about generic or fictional entities, the model often creates responses rather than admitting unfamiliarity. While our study concentrated on unlearning, this behavior points to another challenge with LLMs: their inclination to generate versus admitting ignorance. Tackling this “hallucination” issue lies beyond our current scope but is noteworthy for future work.

The ability to unlearn content would not be very valuable if it caused the model’s performance on unrelated tasks to degrade. As evident, while the model “forgets” Harry Potter, its performance on general benchmarks remains consistent, showcasing the effectiveness of our approach:

To illustrate the process of forgetting as the unlearning algorithm progresses, the following plot shows the probabilities that our model assigns to the next word when completing the prompt “Harry Potter studies“:

Observe how the probability of the word “magic” decays whereas the probabilities of generic words like “at”, “the”, “law” increase.

Whereas our method is designed to target specific content, like the Harry Potter books, it may inadvertently cause the model to forget content that extends to closely-related content beyond the intended target. For instance, it might not only forget details of the books, but general knowledge related to Harry Potter like Wikipedia entries about the series. Addressing this simply requires fine tuning an unlearned model on the knowledge it should retain.

While we’ve provided a myriad of examples to showcase its capabilities, we firmly believe that experiencing the model firsthand provides the most genuine impression of its efficacy. Therefore, we’ve made our fine-tuned model available on HuggingFace (opens in new tab) for hands-on exploration. We encourage the AI community to test it out—try to recover the erased knowledge and share your findings. Your feedback will be invaluable in refining our approach.

How Does It Work?

Our technique leans on a combination of several ideas:

Identifying tokens by creating a reinforced model: We create a model whose knowledge of the unlearn content is reinforced by further fine-tuning on the target data (like Harry Potter) and see which tokens’ probabilities have significantly increased. These are likely content-related tokens that we want to avoid generating.

Expression Replacement: Unique phrases from the target data are swapped with generic ones. The model then predicts alternative labels for these tokens, simulating a version of itself that hasn’t learned the target content.

Fine-tuning: With these alternative labels in hand, we fine-tune the model. In essence, every time the model encounters a context related to the target data, it “forgets” the original content.

For further information about the technique, we refer to our paper. (opens in new tab)

The imperative for ethical, legal, and responsible AI has never been clearer. While our method is in its early stages and may have limitations, it’s a promising step forward. Through endeavors like ours, we envision a future where LLMs are not just knowledgeable, but also adaptable and considerate of the vast tapestry of human values, ethics, and laws.

The post Who’s Harry Potter? Making LLMs forget appeared first on Microsoft Research.

Research Forum Brief | September 2024 Articles

Keynote: Phi-3-Vision: A highly capable and “small” language vision model

Transcript: Keynote

Related resources

Who’s Harry Potter? Making LLMs forget