{"id":1063530,"date":"2024-11-14T09:00:00","date_gmt":"2024-11-14T17:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1063530"},"modified":"2024-11-15T07:52:28","modified_gmt":"2024-11-15T15:52:28","slug":"orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/orca-agentinstruct-agentic-flows-can-be-effective-synthetic-data-generators\/","title":{"rendered":"Orca-AgentInstruct: Agentic flows can be effective synthetic-data generators"},"content":{"rendered":"\n
\"Orca-3<\/figure>\n\n\n\n

Our work on Orca<\/a> and Orca 2<\/a> demonstrated the power of using synthetic data for the post-training of small language models and getting them to levels of performance previously found only in much larger language models. Orca-AgentInstruct is another step in this direction, where we explore using agentic flows to generate diverse and high-quality data at scale. Orca-AgentInstruct is an agentic solution for synthetic-data generation. By leveraging an agentic framework, AgentInstruct can generate tailored datasets, comprising both prompts and responses, from raw data sources, paving the way to building a synthetic data factory for model fine-tuning.  <\/p>\n\n\n\n

The efficacy of this approach is exemplified by the substantial improvement observed by fine-tuning a base Mistral 7-billion-parameter model and using AgentInstruct to generate a 25-million-pair dataset. The fine-tuned model (which we refer to as Orca-3-Mistral) showcases a notable performance gain across multiple benchmarks. For example, it shows 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH, 45% improvement on AlpacaEval, and a 31.34% reduction of inaccurate or unreliable results across multiple summarization benchmarks.<\/p>\n\n\n\n

We are making a 1-million-pair subset (orca-agentinstruct-1M<\/a>) of this dataset publicly available, along with a report<\/a> describing the data generation procedure, to encourage research on synthetic data generation and finetuning of language models. <\/p>\n\n\n\n

\"Bar
Figure 1: Effect of using AgentInstruct for post-training Mistral-7B. <\/figcaption><\/figure>\n\n\n\n
\"The
Figure 2. This figure provides a thematic overview of the roles played by different groups of agents. Content Transformation Flow converts the seed into an intermediate representation that makes it easier to create high-quality and diverse data. Seed Instruction Generation Flow creates instances of the target tasks following a taxonomy. Instruction Refinement Flow explores the space further by starting from these initial data points and exploring the neighborhood. The expectation is that by picking a random seed we will be able to cover the entire region of data points. <\/figcaption><\/figure>\n\n\n\n

Synthetic Data Accelerated LLM Development:<\/strong> Over the past year, using synthetic data has greatly advanced the training of large language models (LLMs). It sped up model training at all stages, from pre-training (e.g., Phi-3) to instruction-tuning (e.g., Orca and WizardLM) and reinforcement learning from human feedback (e.g., Direct Nash Optimization).<\/span> <\/p>\n\n\n\n

\n\t