{"id":1143984,"date":"2025-07-07T03:47:59","date_gmt":"2025-07-07T10:47:59","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-blog-post&p=1143984"},"modified":"2025-07-07T21:32:03","modified_gmt":"2025-07-08T04:32:03","slug":"synthllm-breaking-the-ai-data-wall-with-scalable-synthetic-data","status":"publish","type":"msr-blog-post","link":"https:\/\/www.microsoft.com\/en-us\/research\/articles\/synthllm-breaking-the-ai-data-wall-with-scalable-synthetic-data\/","title":{"rendered":"SynthLLM: Breaking the AI “data wall” with scalable synthetic data"},"content":{"rendered":"\n

One of the driving forces behind AI\u2019s rapid progress is access to large-scale, high-quality data, essential to enable training models to continuously improve and perform reliably. But that well is running dry. As the supply of usable internet data shrinks, it\u2019s becoming harder and more expensive to gather the kind of training data AI needs. Researchers call this challenge the \u201cdata wall\u201d\u2014a barrier that slows development and increases costs.<\/p>\n\n\n\n

To break through this wall, many are turning to synthetic data. Though it\u2019s artificially generated, synthetic data can closely mimic real-world patterns. What hasn\u2019t been clear is whether the same rules that govern model performance with natural data, known as scaling laws<\/em>, also apply to synthetic data.<\/p>\n\n\n\n

In natural datasets, LLMs tend to follow a predictable power-law relationship among performance, model size, and the amount of training data. This helps researchers estimate how well a model will perform based on available resources.<\/p>\n\n\n\n

To find out whether synthetic data follows these same scaling laws, Microsoft Research Asia created SynthLLM<\/a>, a system for generating synthetic data at scale based on pretraining corpus. After extensive testing, the team confirmed that these scaling laws do hold, laying the groundwork for synthetic data to play a larger role in training and optimizing large language models (LLMs).<\/p>\n\n\n\n

Synthetic data follows a new scaling pattern<\/h2>\n\n\n\n

Using the SynthLLM framework, researchers have shown that LLMs fine-tuned on synthetic data do follow a modified version, what they call a rectified scaling law<\/em>. Key findings include:<\/p>\n\n\n\n

    \n
  • Predictable performance scaling<\/strong>: Synthetic data generated by SynthLLM produces consistent performance gains across different model sizes. This predictability helps researchers match training data volumes more effectively to model size.<\/li>\n\n\n\n
  • Performance levels off at 300 billion tokens<\/strong>: Beyond this point, adding more synthetic data brings only minor improvements. Identifying this plateau makes it easier to optimize training strategies.<\/li>\n\n\n\n
  • Larger models need less data<\/strong>: An eight-billion-parameter model reaches near-optimal performance with one trillion tokens, while a smaller three-billion-parameter model needs four trillion, as shown in Figure 1. This inverse trend offers useful guidance for building and scaling models efficiently.<\/li>\n<\/ul>\n\n\n\n
    \"chart,
    Figure 1. Synthetic data generated by SynthLLM consistently follows the rectified scaling law across model sizes. The curves represent error rates, not accuracy.<\/figcaption><\/figure>\n\n\n\n

    SynthLLM: A scalable approach to diverse synthetic datasets<\/h2>\n\n\n\n

    One common strategy for training language models is to use question-and-answer pairs as data, especially for tasks like reasoning, problem-solving, or knowledge retrieval. Traditionally, synthetic datasets of this kind rely on a small number of manually labeled seed samples, an approach that limits both scale and variety. In contrast, the vast, diverse collections of web documents used in pretraining offer an underused resource for building more scalable synthetic data.<\/p>\n\n\n\n

    SynthLLM builds on this potential with a three-stage process:<\/p>\n\n\n\n

      \n
    1. Selecting <\/strong>high-quality web content related to the domain.<\/li>\n\n\n\n
    2. Generating <\/strong>prompts with open-source LLMs, using three complementary methods that progressively increase the variety of questions.<\/li>\n\n\n\n
    3. Producing <\/strong>answers foreach prompt to create complete data samples.<\/li>\n<\/ol>\n\n\n\n

      Unlike earlier methods that depend on back-translation or simple question extraction, SynthLLM uses graph algorithms to identify and recombine high-level concepts from multiple documents. This allows it to create deeper conceptual connections,  and to more efficiently take advantage of limited reference documents to scale diverse questions.<\/p>\n\n\n\n

      The result? A broader range of synthetic questions. Figure 2 shows that SynthLLM\u2019s multi-step process leads to greater diversity.<\/p>\n\n\n\n

      \"chart,
      Figure 2. Bar chart showing question similarity within the same document<\/figcaption><\/figure>\n\n\n\n

      In direct comparisons with existing methods for expanding training data, SynthLLM\u2019s knowledge-guided approach makes better use of limited source material to generate high-quality questions. This leads to stronger model performance across benchmarks, as illustrated in Figure 3.<\/p>\n\n\n\n

      \"chart,
      Figure 3. (a) Performance of other data expansion methods on the MATH benchmark. (b) Average model performance across multiple benchmarks (x-axis: sample ID; y-axis: accuracy)<\/figcaption><\/figure>\n\n\n\n

      A renewable resource for training data<\/h2>\n\n\n\n

      As the data wall continues to rise, synthetic data is emerging as an essential resource for AI development. It\u2019s scalable, fast to produce, cost-effective, and doesn\u2019t require manual labeling\u2014making it a practical solution to growing data shortages.<\/p>\n\n\n\n

      Its value spans a range of fields. In healthcare, it protects patient privacy. In autonomous driving, it powers virtual simulations. In education, it enables the creation of millions of math problems on demand.<\/p>\n\n\n\n

      SynthLLM simplifies the generation of synthetic data, helping researchers make better use of it to train LLMs. The framework is adaptable to domains like code generation, physics, chemistry, and healthcare\u2014broadening its potential to support research across disciplines.<\/p>\n\n\n\n

      Researchers are now working to further improve SynthLLM\u2019s efficiency and explore its use in continue pretraining. These efforts aim to raise the quality and broaden impact of synthetic data, helping drive the next wave of AI development.<\/p>\n","protected":false},"excerpt":{"rendered":"

      One of the driving forces behind AI\u2019s rapid progress is access to large-scale, high-quality data, essential to enable training models to continuously improve and perform reliably. But that well is running dry. As the supply of usable internet data shrinks, it\u2019s becoming harder and more expensive to gather the kind of training data AI needs. […]<\/p>\n","protected":false},"author":34512,"featured_media":1138770,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1143984","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984\/revisions"}],"predecessor-version":[{"id":1143988,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984\/revisions\/1143988"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1138770"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1143984"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1143984"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1143984"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1143984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}