Research Forum | Episode 4 - abstract chalkboard background with colorful network nodes and circular icons

Research Forum Brief | September 2024

Keynote: Phi-3-Vision: A highly capable and “small” language vision model

Share this page

Jianfeng Gao

“Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs.”

Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

Transcript: Keynote

Phi-3-Vision: A highly capable and “small” language vision model

Jianfeng Gao, Distinguished Scientist and Vice President, Microsoft Research Redmond

This talk introduces Phi-3-Vision, an advanced and economical open-source multimodal model. As a member of the Phi-3 model family, Phi-3-Vision enhances language models by integrating multisensory skills, seamlessly combining language and vision capabilities.

Microsoft Research Forum, September 3, 2024

JIANFENG GAO: Hi. Welcome to Microsoft Research Forum. My name is Jianfeng Gao. I’m a distinguished scientist and vice president at Microsoft Research. Today, I’m going to talk about our latest AI foundation model, Phi-3-Vision, a highly capable and cost-effective open-source vision-language model. The model seamlessly combines language and vision capabilities, and the model weights are released to the public to allow everyone to develop better models on top of it.

First, let me use a few examples to show you what the model can do. A typical use case of a vision-language model is vision question answering, where the model is asked to answer questions regarding an input image. As illustrated in this example, the question—what is the tower building in the image?—requires an understanding of language, vision, and commonsense knowledge to answer. For example, the model needs to not only recognize the tower is the Space Needle but also know that it is one of the most recognizable landmarks in the city and offers panoramic views of Seattle and the surrounding area. Compared to popular language-vision models on the market, including those released by Microsoft, such as Kosmos, LLaVA, and Florence, Phi-3-Vision is not only much smaller but has much stronger understanding and reasoning capabilities, especially in non-natural image domains, such as tables, charts, and diagrams.

As shown in this example, we presented the model a coffee shop menu, which is by no means a high-quality image, and asked such questions as, what is the price of a cappuccino with a large size, how much does it cost to add ice to the tea, and if someone wants to buy a pot of tea, how much would it cost? The model can produce correct answers by reasoning over relevant knowledge extracted from the image, such as “The price is $3.25,” “It costs an additional $1 to add ice to any tea,” and “A pot [of] tea will cost $4.” The model can also extract all the texts from the image and generate a table using the format specified by users, such as the Markdown table and the JSON representation.

Here is another example where the model is asked to generate an insightful report from a chart, and it’s told that the report will be used to make important decisions. We see that the model-generated report is very well structured. It starts with an introduction to what the chart is about, then gives four insights based on the chart, and concludes the report with a suggestion to the decision makers.

Here is how the model works. Phi-3-Vision is a language-vision model designed to process the image and a text prompt as inputs and generate text outputs. The model is composed of two primary components: a vision encoder and a language decoder. The vision encoder, which is based on the CLIP vision transformer model, extracts visual tokens from an input image. Then these visual tokens are concatenated with text tokens and are fed to the transformer language decoder, which is based on Phi-3-mini-128k model, to generate output text. The strong performance of Phi-3-Vision is mainly attributed to the use of a strong transformer language model. A language model predicts the next word based on its context. The complexity of a language model depends to a large degree upon the length of the context it can encode. Encoding longer context often leads to a better model. As in this example, the model needs to encode a long context to include the word “dog” to predict the next word, “barking.” Language modeling is a long-standing research topic dating back to the 1950s [with] Shannon’s application of information theory to human language, where he measured how well simple N-gram language models predict natural language text. However, these N-gram models can only handle very short context because the model size grows exponentially with context lengths.

Traditional neural language models such as recurrent neural networks compress the context to a fixed-sized vector to capture long context while keeping the computational cost manageable. In contrast, transformers can effectively include a very long but uncompressed text via self-attention. This is why transformer models are so successful. Recently, sparse attention mechanisms have been explored to deal with the quadratic complexity of self-attention as the model takes increasingly long input token sequence, as we will talk about in a minute.

Scaling laws suggest that we can keep improving the performance of transformer language models by increasing model size and the training data. As a result, we have witnessed the emergence of many very large language models, such as GPT-4. These large language models show emergent abilities, such as in-context learning, where the model learns to perform new tasks given only a few demonstrations without additional model training. These abilities make larger language models the building block of general-purpose AI systems way beyond language understanding and generation. However, these scaling laws assume a “fixed” data source. This assumption is now significantly disrupted by the existence of frontier language models themselves, which allow us to interact with data in novel ways. For example, it has been reported that a combination of large language model–based filtering of web data and large language model–created synthetic data enables model abilities in smaller language models that were typically seen only in much larger models, such as in-context learning.

This inspires Microsoft to develop a family of small language models called Phi-3 models. These models are highly capable in ways that larger language models are but are far more cost effective. As shown in this figure, the Phi-3 language models are the best performers across the quality and cost curve. For example, Phi-3-mini outperforms the models twice its size, including Llama 3 and Mistral 7B. The Phi-3-Vision model is based on Phi-3-mini; that’s a language decoder. Vision encoder extracts vision tokens from input image. To encode extremely long context due to the large number of vision tokens extracted from high-resolution images, our transformer-based vision encoder uses a sparse attention mechanism based on dynamic cropping.

In this example, we split an input image into 2D attention blocks and build for each block a local attention map by computing attention scores only within the block. To encode dependencies among tokens in different blocks, we resize the high-resolution input image into a low-resolution image so that all visual tokens can fit in one attention block and build a global attention map for the whole input image, although using its coarse-grained version.

The model is trained in two phases: pretraining and post-training. In the pretraining phase, the model acquires general skills for vision-language understanding and reasoning. Phi-3-Vision is pretrained on a diverse dataset of approximately 100 million text-image pairs extracted from web documents, synthesized from OCR of PDF files, and datasets for chart and table comprehension. The post-training phase consists of two stages: supervised fine-tuning and directed preference optimization. Supervised fine-tuning, or SFT, enhances [a] model’s ability to follow human instructions to solve downstream tasks. The training that we used consists of 15 billion tokens, and there is a combination of multimodal instruction-tuning data that covers diverse domains and tasks, such as understanding and reasoning over natural images, like the Space Needle picture I described before, as well as non-natural images, such as charts, tables, and diagrams. Directed preference optimization, or DPO, improves model safety by aligning model outputs with human preference. We used a highly selective preference dataset, which consists of triples. Each triple contains a prompt, a human-chosen answer to the prompt, and a rejected answer. The model is trained to always prefer the chosen answer to the rejected answer.

We evaluate the performance of Phi-3-Vision on AI benchmarks in three categories: science, charts, and generic knowledge. We see that Phi-3-Vision significantly outperforms all the other open-source models, which have much larger model size, on almost all the benchmarks. Compared to the best closed-source models, such as GPT-4V, although there’s still a performance gap on the generic knowledge benchmarks, such as MMMU, on many science question answering and chart reasoning tasks, the performance of Phi-3-Vision is better despite its much smaller model size.

Microsoft’s mission is to empower every person and every organization on the planet to achieve more. If we want generative AI to be truly globally equitable—to reach everyone on the planet—we need to increase capacities while reducing costs. Given the popularity of models like GPT-4 and their adoption at massive scale, reducing costs is a very important part of achieving this mission. Phi-3-Vision is the first multimodal model in the Phi small model family. It matches and sometimes exceeds some of the capabilities of much larger models, such as GPT-4V, at a much lower cost. And to help everyone build more affordable and accessible AI systems, we have released the model weights into the open-source community. In future work, we will extend the model to have new abilities such as action planning for embodied AI and robotics, where cost effectiveness is particularly important. If you want to learn more about this exciting field, keep watching for the panel discussion hosted by my colleague John Langford. Thank you for joining us today.