Figure 3. (a) Performance of other data expansion methods on the MATH benchmark. (b) Average model performance across multiple benchmarks (x-axis: sample ID; y-axis: accuracy)<\/figcaption><\/figure>\n\n\n\nA renewable resource for training data<\/h2>\n\n\n\n
As the data wall continues to rise, synthetic data is emerging as an essential resource for AI development. It\u2019s scalable, fast to produce, cost-effective, and doesn\u2019t require manual labeling\u2014making it a practical solution to growing data shortages.<\/p>\n\n\n\n
Its value spans a range of fields. In healthcare, it protects patient privacy. In autonomous driving, it powers virtual simulations. In education, it enables the creation of millions of math problems on demand.<\/p>\n\n\n\n
SynthLLM simplifies the generation of synthetic data, helping researchers make better use of it to train LLMs. The framework is adaptable to domains like code generation, physics, chemistry, and healthcare\u2014broadening its potential to support research across disciplines.<\/p>\n\n\n\n
Researchers are now working to further improve SynthLLM\u2019s efficiency and explore its use in continue pretraining. These efforts aim to raise the quality and broaden impact of synthetic data, helping drive the next wave of AI development.<\/p>\n","protected":false},"excerpt":{"rendered":"
One of the driving forces behind AI\u2019s rapid progress is access to large-scale, high-quality data, essential to enable training models to continuously improve and perform reliably. But that well is running dry. As the supply of usable internet data shrinks, it\u2019s becoming harder and more expensive to gather the kind of training data AI needs. […]<\/p>\n","protected":false},"author":34512,"featured_media":1138770,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-content-parent":199560,"msr_hide_image_in_river":null,"footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-post-option":[269148,269142],"class_list":["post-1143984","msr-blog-post","type-msr-blog-post","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river"],"msr_assoc_parent":{"id":199560,"type":"lab"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-blog-post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/34512"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984\/revisions"}],"predecessor-version":[{"id":1143988,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-blog-post\/1143984\/revisions\/1143988"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1138770"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1143984"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1143984"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1143984"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1143984"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}