{"id":991293,"date":"2023-12-12T06:00:00","date_gmt":"2023-12-12T14:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=991293"},"modified":"2023-12-15T16:20:52","modified_gmt":"2023-12-16T00:20:52","slug":"phi-2-the-surprising-power-of-small-language-models","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/phi-2-the-surprising-power-of-small-language-models\/","title":{"rendered":"Phi-2: The surprising power of small language models"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\" id=\"contributors\">Contributors<\/h3>\n\n\n\n<p><a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/maabdin\/\">Marah Abdin<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jyotianeja\/\">Jyoti Aneja<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sebubeck\/\">Sebastien Bubeck<\/a>, Caio C\u00e9sar Teodoro Mendes, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/wzchen\/\">Weizhu Chen<\/a>, Allie Del Giorno, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/roneneldan\/\">Ronen Eldan<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/sigopi\/\">Sivakanth Gopi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/suriyag\/\">Suriya Gunasekar<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mojavaheripi\/\">Mojan Javaheripi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pkauffmann\/\">Piero Kauffmann<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/yintatlee\/\">Yin Tat Lee<\/a>, Yuanzhi Li, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/anhnguyen\/\">Anh Nguyen<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/gderosa\/\">Gustavo de Rosa<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/olsaarik\/\">Olli Saarikivi<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/adilsalim\/\">Adil Salim<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/shitals\/\">Shital Shah<\/a>, Michael Santacroce, Harkirat Singh Behl, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/tag\/adam-kalai\/\">Adam Taumann Kalai<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/wanxin\/\">Xin Wang<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/rachelward\/\">Rachel Ward<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/pwitte\/\">Philipp Witte<\/a>, <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/cyrilzhang\/\">Cyril Zhang<\/a>, Yi Zhang<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1.jpg\" alt=\"Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.\" class=\"wp-image-991311\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 1. <\/strong>Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.<\/figcaption><\/figure>\n\n\n\n<p>Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called \u201cPhi\u201d that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/phi-1\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Phi-1<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/huggingface.co\/microsoft\/phi-1_5\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Phi-1.5<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, with performance comparable to models 5x larger.<\/p>\n\n\n\n<p>We are now releasing <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ai.azure.com\/explore\/models\/microsoft-phi-2\/version\/4\/registry\/azureml-msr\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Phi-2<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.<\/p>\n\n\n\n<p>With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ai.azure.com\/explore\/models\/microsoft-phi-2\/version\/4\/registry\/azureml-msr\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Phi-2<\/strong><span class=\"sr-only\"> (opens in new tab)<\/span><\/a><strong> <\/strong>available in the Azure AI Studio model catalog to foster research and development on language models.<\/p>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"670821\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">Spotlight: Microsoft research newsletter<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300\" href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/09\/Newsletter_Banner_08_2019_v1_1920x1080.png\" alt=\"\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">Microsoft Research Newsletter<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p class=\"large\">Stay connected to the research community at Microsoft.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button is-style-fill-chevron\">\n\t\t\t\t\t\t<a href=\"https:\/\/info.microsoft.com\/ww-landing-microsoft-research-newsletter.html\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" aria-label=\"Microsoft Research Newsletter\" data-bi-cN=\"Microsoft Research Newsletter\" target=\"_blank\">\n\t\t\t\t\t\t\tSubscribe today\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"key-insights-behind-phi-2\">Key Insights Behind Phi-2<\/h2>\n\n\n\n<p>The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.<\/p>\n\n\n\n<p>Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:<\/p>\n\n\n\n<p>Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on \u201ctextbook-quality\u201d data, following upon our prior work \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/textbooks-are-all-you-need\/\">Textbooks Are All You Need<\/a>.\u201d Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"25377\" height=\"5876\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp.png\" alt=\"A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks. \" class=\"wp-image-991359\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp.png 25377w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-300x69.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-1024x237.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-768x178.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-1536x356.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-2048x474.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure2_phi_comp-240x56.png 240w\" sizes=\"auto, (max-width: 25377px) 100vw, 25377px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 2. <\/strong>Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"training-details\">Training Details<\/h2>\n\n\n\n<p>Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/textbooks-are-all-you-need-ii-phi-1-5-technical-report\/\">previous tech report<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> for more details on this. For more information about the Phi-2 model, please visit <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/ai.azure.com\/explore\/models\/microsoft-phi-2\/version\/4\/registry\/azureml-msr\" target=\"_blank\" rel=\"noreferrer noopener\">Azure AI | Machine Learning Studio<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"16803\" height=\"8165\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores.png\" alt=\"A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories. \" class=\"wp-image-991365\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores.png 16803w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-300x146.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-1024x498.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-768x373.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-1536x746.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-2048x995.png 2048w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/figure3_safety_scores-240x117.png 240w\" sizes=\"auto, (max-width: 16803px) 100vw, 16803px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 3. <\/strong>Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"phi-2-evaluation\">Phi-2 Evaluation<\/h2>\n\n\n\n<p>Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).<\/p>\n\n\n\n<p>With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.<\/p>\n\n\n\n<p>Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report \u201c<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/textbooks-are-all-you-need\/\">Textbooks Are All You Need<\/a>.\u201d Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Model<\/th><th>Size<\/th><th>BBH<\/th><th>Commonsense<br>Reasoning<\/th><th>Language<br>Understanding<\/th><th>Math<\/th><th>Coding<\/th><\/tr><\/thead><tbody><tr><td rowspan=\"3\">Llama-2<\/td><td>7B<\/td><td>40.0<\/td><td>62.2<\/td><td>56.7<\/td><td>16.5<\/td><td>21.0<\/td><\/tr><tr><td>13B<\/td><td>47.8<\/td><td>65.0<\/td><td>61.9<\/td><td>34.2<\/td><td>25.4<\/td><\/tr><tr><td>70B<\/td><td>66.5<\/td><td>69.2<\/td><td>67.6<\/td><td>64.1<\/td><td>38.3<\/td><\/tr><tr><td>Mistral<\/td><td>7B<\/td><td>57.2<\/td><td>66.4<\/td><td>63.7<\/td><td>46.4<\/td><td>39.4<\/td><\/tr><tr><td>Phi-2<\/td><td>2.7B<\/td><td>59.2<\/td><td>68.8<\/td><td>62.0<\/td><td>61.1<\/td><td>53.7<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><center><strong>Table 1.<\/strong> Averaged performance on grouped benchmarks compared to popular open-source SLMs.<\/center><\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th>Model<\/th><th>Size<\/th><th>BBH<\/th><th>BoolQ<\/th><th>MBPP<\/th><th>MMLU<\/th><\/tr><\/thead><tbody><tr><td>Gemini Nano 2<\/td><td>3.2B<\/td><td>42.4<\/td><td>79.3<\/td><td>27.2<\/td><td>55.8<\/td><\/tr><tr><td>Phi-2<\/td><td>2.7B<\/td><td>59.3<\/td><td>83.3<\/td><td>59.1<\/td><td>56.7<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\"><center><strong>Table 2.<\/strong> Comparison between Phi-2 and Gemini Nano 2 Model on Gemini\u2019s reported benchmarks.<\/center><\/figcaption><\/figure>\n\n\n\n<p>In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model\u2019s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1347\" height=\"758\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4.png\" alt=\"An example prompt is given to Phi-2 which says \u201cA skier slides down a frictionless slope of height 40m and length 80m. What's the skier\u2019s speed at the bottom?\u201d. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas. \" class=\"wp-image-991326\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4.png 1347w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-300x169.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-1024x576.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-768x432.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-1066x600.png 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-655x368.png 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-343x193.png 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-240x135.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-640x360.png 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-960x540.png 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig4-1280x720.png 1280w\" sizes=\"auto, (max-width: 1347px) 100vw, 1347px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 4. <\/strong>Phi-2&#8217;s output on a simple physics problem, which includes an approximately correct square root calculation.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1600\" height=\"710\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5.png\" alt=\"The model is then provided with a student\u2019s wrong answer to the skier physics problem and asked if it can correct the student\u2019s mistake. Phi-2 replies with the student\u2019s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula. \" class=\"wp-image-991332\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5.png 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5-300x133.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5-1024x454.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5-768x341.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5-1536x682.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Seb_Fig5-240x107.png 240w\" sizes=\"auto, (max-width: 1600px) 100vw, 1600px\" \/><figcaption class=\"wp-element-caption\"><strong>Figure 5. <\/strong>Similarly to Gemini\u2019s test we also further queried Phi-2 with a student\u2019s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra\u2019s output described in the Gemini report, in particular in the latter case the student\u2019s answer was given as an image with handwritten text rather than raw text in our case.<\/figcaption><\/figure>\n","protected":false},"excerpt":{"rendered":"<p>Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on a variety of tasks.<\/p>\n","protected":false},"author":42183,"featured_media":991311,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[243984],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-991293","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-blog-homepage-featured"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[968280],"related-researchers":[{"type":"user_nicename","value":"Mojan Javaheripi","user_id":42777,"display_name":"Mojan Javaheripi","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mojavaheripi\/\" aria-label=\"Visit the profile page for Mojan Javaheripi\">Mojan Javaheripi<\/a>","is_active":false,"last_first":"Javaheripi, Mojan","people_section":0,"alias":"mojavaheripi"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-343x193.jpg 343w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2023\/12\/Phi2-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/mojavaheripi\/\" title=\"Go to researcher profile for Mojan Javaheripi\" aria-label=\"Go to researcher profile for Mojan Javaheripi\" data-bi-type=\"byline author\" data-bi-cN=\"Mojan Javaheripi\">Mojan Javaheripi<\/a> and S\u00e9bastien Bubeck","formattedDate":"December 12, 2023","formattedExcerpt":"Phi-2 is now accessible on the Azure model catalog. Its compact size and new innovations in model scaling and training data curation make it ideal for exploration around mechanistic interpretability, safety improvements, and fine-tuning experimentation on a variety of tasks.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991293","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/42183"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=991293"}],"version-history":[{"count":26,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991293\/revisions"}],"predecessor-version":[{"id":993270,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/991293\/revisions\/993270"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/991311"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=991293"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=991293"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=991293"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=991293"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=991293"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=991293"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=991293"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=991293"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=991293"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=991293"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=991293"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}