{"id":1144422,"date":"2025-07-23T09:00:00","date_gmt":"2025-07-23T16:00:00","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=1144422"},"modified":"2025-07-31T08:57:03","modified_gmt":"2025-07-31T15:57:03","slug":"technical-approach-for-classifying-human-ai-interactions-at-scale","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/technical-approach-for-classifying-human-ai-interactions-at-scale\/","title":{"rendered":"Technical approach for classifying human-AI interactions at scale"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"788\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1.jpg\" alt=\"The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon shows a stylized person in front of a computer screen. The third icon shows an organization tree with one main node and three nodes branching out side by side below it.\" class=\"wp-image-1144473\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1.jpg 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><\/figure>\n\n\n\n<p>As large language models (LLMs) become foundational to modern AI systems, the ability to run them at scale\u2014efficiently, reliably, and in near real-time\u2014is no longer a nice-to-have. It\u2019s essential. The <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/project\/semantic-telemetry\/?msockid=153992cb7df169482b9487167c0968e9\">Semantic Telemetry<\/a> project tackles this challenge by applying LLM-based classifiers to hundreds of millions of sampled, anonymized Bing Chat conversations each week. These classifiers extract signals like user expertise, primary topic, and satisfaction, enabling deeper insight into human-AI interactions and driving continuous system improvement.<\/p>\n\n\n\n<p>But building a pipeline that can handle this volume isn\u2019t just about plugging into an API. It requires a high-throughput, high-performance architecture that can orchestrate distributed processing, manage token and prompt complexity, and gracefully handle the unpredictability of remote LLM endpoints.<\/p>\n\n\n\n<p>In this latest post in our series on Semantic Telemetry, we\u2019ll walk through the engineering behind that system\u2014how we designed for scale from the start, the trade-offs we made, and the lessons we learned along the way. From batching strategies and token optimization and orchestration, we\u2019ll share what it takes to build a real-time LLM classification pipeline.<\/p>\n\n\n\n<p>For additional project background: <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/semantic-telemetry-understanding-how-users-interact-with-ai-systems\/\">Semantic Telemetry: Understanding how users interact with AI systems<\/a> and <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/engagement-user-expertise-and-satisfaction-key-insights-from-the-semantic-telemetry-project\/\">Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project<\/a>.<\/p>\n\n\n\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-9d6595d7 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/semantic-telemetry-understanding-how-users-interact-with-ai-systems\/\" data-bi-cN=\"Semantic Telemetry: Understanding how users interact with AI systems\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Semantic Telemetry: Understanding how users interact with AI systems<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\">\n<div class=\"annotations \" data-bi-aN=\"citation\">\n\t<article class=\"annotations__list card depth-16 bg-body p-4 \">\n\t\t<div class=\"annotations__list-item\">\n\t\t\t\t\t\t<span class=\"annotations__type d-block text-uppercase font-weight-semibold text-neutral-300 small\">Blog<\/span>\n\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/blog\/engagement-user-expertise-and-satisfaction-key-insights-from-the-semantic-telemetry-project\/\" data-bi-cN=\"Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project\" data-external-link=\"false\" data-bi-aN=\"citation\" data-bi-type=\"annotated-link\" class=\"annotations__link font-weight-semibold text-decoration-none\"><span>Engagement, user expertise, and satisfaction: Key insights from the Semantic Telemetry Project<\/span>&nbsp;<span class=\"glyph-in-link glyph-append glyph-append-chevron-right\" aria-hidden=\"true\"><\/span><\/a>\t\t\t\t\t<\/div>\n\t<\/article>\n<\/div>\n<\/div>\n<\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"system-architecture-highlights\">System architecture highlights<\/h2>\n\n\n\n<p>The <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/spark.apache.org\/docs\/latest\/api\/python\/index.html\" target=\"_blank\" rel=\"noopener noreferrer\">Semantic Telemetry pipeline<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> is a highly-scalable, highly-configurable, data transformation pipeline. While it follows a familiar ETL structure, several architectural innovations make it uniquely suited for high-throughput LLM integration:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Hybrid compute engine<\/strong><br>The pipeline combines the distributed power of\u202fPySpark\u202fwith the speed and simplicity of\u202fPolars, enabling it to scale across large datasets or run lightweight jobs in Spark-less environments\u2014without code changes.<\/li>\n\n\n\n<li><strong>LLM-centric transformation layer<\/strong><br>At the core of the pipeline is a multi-stage transformation process tailored for running across multiple LLM endpoints such that:\n<ul class=\"wp-block-list\">\n<li>Runs model agnostic. Provides a generic interface for LLMs and adopts model specific interfaces built from a generic interface.<\/li>\n\n\n\n<li>Prompt templates are defined using the Prompty language specification for consistency and reuse, with options for users to include custom prompts.<\/li>\n\n\n\n<li>Parsing and cleaning logic ensures structured, schema-aligned outputs, even when LLM responses are imperfect such as removing extra characters in output, resolving not-exact label matches (i.e. \u201ccreate\u201d versus \u201ccreated\u201d) and relabeling invalid classifications.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1400\" height=\"650\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px.png\" alt=\"Figure 1. Architecture diagram of LLM workflow\" class=\"wp-image-1144472\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px.png 1400w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px-300x139.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px-1024x475.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px-768x357.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/Semantic-Telemetry-Pipeline-2_1400px-240x111.png 240w\" sizes=\"auto, (max-width: 1400px) 100vw, 1400px\" \/><figcaption class=\"wp-element-caption\">Figure 1. Architecture diagram<\/figcaption><\/figure>\n\n\n\n<p>The pipeline supports multiple classification tasks (e.g., user expertise, topic, satisfaction) through modular prompt templates and configurable execution paths\u2014making it easy to adapt to new use cases or environments.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"engineering-challenges-solutions\">Engineering challenges & solutions<\/h2>\n\n\n\n<p>Building a high-throughput, LLM-powered classification pipeline at scale introduced a range of engineering challenges\u2014from managing latency and token limits to ensuring system resilience. Below are the key hurdles we encountered and how we addressed them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"llm-endpoint-latency-variability\">LLM endpoint latency & variability<\/h3>\n\n\n\n<p><strong>Challenge<\/strong>: LLM endpoints, especially those hosted remotely (e.g., Azure OpenAI), introduce unpredictable latency due to model load, prompt complexity, and network variability. This made it difficult to maintain consistent throughput across the pipeline.<\/p>\n\n\n\n<p><strong>Solution<\/strong>: We implemented a combination of:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Multiple Azure OpenAI endpoints<\/strong> in rotation to increase throughput and distribute workload. We can analyze throughput and redistribute as needed.<\/li>\n\n\n\n<li><strong>Saving output in intervals<\/strong> to write data asynchronously in case of network errors.<\/li>\n\n\n\n<li><strong>Utilizing models with higher tokens per minute (TPM)<\/strong> such as OpenAI\u2019s GPT-4o mini. GPT-4o mini had a 2M TPM limit which is a 25x throughput increase from GPT-4 (80K TPM -> 2M TPM)<\/li>\n\n\n\n<li><strong>Timeouts and retries<\/strong> with exponential backoff.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"evolving-llm-models-prompt-alignment\">Evolving LLM models & prompt alignment<\/h3>\n\n\n\n<p><strong>Challenge<\/strong>: Each new LLM release\u2014such as Phi, Mistral, DeepSeek, and successive generations of GPT (e.g., GPT-3.5, GPT-4, GPT-4 Turbo, GPT-4o)\u2014brings improvements, but also subtle behavioral shifts. These changes can affect classification consistency, output formatting, and even the interpretation of prompts. Maintaining alignment with baseline expectations across models became a moving target.<\/p>\n\n\n\n<p><strong>Solution<\/strong>: We developed a model evaluation workflow to test prompt alignment across LLM versions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Small-sample testing<\/strong>: We ran the pipeline on a representative sample using the new model and compared the output distribution to a known baseline.<\/li>\n\n\n\n<li><strong>Distribution analysis<\/strong>: If the new model\u2019s output aligned closely, we scaled up testing. If not, we iteratively\u202f<strong>tuned the prompts<\/strong>\u202fand re-ran comparisons.<\/li>\n\n\n\n<li><strong>Interpretation flexibility<\/strong>: We also recognized that a shift in distribution isn\u2019t always a regression. Sometimes it reflects a more accurate or nuanced classification, especially as models improve.<\/li>\n<\/ul>\n\n\n\n<p>To support this process, we used tools like <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/sammo\" target=\"_blank\" rel=\"noopener noreferrer\">Sammo<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which allowed us to compare outputs across multiple models and prompt variants. This helped us quantify the impact of prompt changes and model upgrades and make informed decisions about when to adopt a new model or adjust our classification schema.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"dynamic-concurrency-scaling-for-llm-calls\">Dynamic concurrency scaling for LLM calls<\/h3>\n\n\n\n<p><strong>Challenge<\/strong>: LLM endpoints frequently encounter rate limits and inconsistent response times under heavy usage. The models&#8217; speeds can also vary, complicating the selection of optimal concurrency levels. Furthermore, users may choose suboptimal settings due to lack of familiarity, and default concurrency configurations are rarely ideal for every situation. Dynamic adjustments based on throughput, measured in various ways, can assist in determining optimal concurrency levels.<\/p>\n\n\n\n<p><strong>Solution<\/strong>: We implemented a dynamic concurrency control mechanism that proactively adjusts the number of parallel LLM calls based on real-time system behavior:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>External task awareness<\/strong>: The system monitors the number of parallel tasks running across the pipeline (e.g., Spark executors or async workers) and uses this to inform the initial concurrency level.<\/li>\n\n\n\n<li><strong>Success\/failure rate monitoring<\/strong>: The system tracks the rolling success and failure rates of LLM calls. A spike in failures triggers a temporary reduction in concurrency, while sustained success allows for gradual ramp-up.<\/li>\n\n\n\n<li><strong>Latency-based feedback loop<\/strong>: Instead of waiting for rate-limit errors, measure the\u202fresponse time\u202fof LLM calls. If latency increases, reduce concurrency; if latency decreases and success rates remain high, cautiously scale up.<\/li>\n<\/ul>\n\n\n\n\t<div class=\"border-bottom border-top border-gray-300 mt-5 mb-5 msr-promo text-center text-md-left alignwide\" data-bi-aN=\"promo\" data-bi-id=\"1160910\">\n\t\t\n\n\t\t<p class=\"msr-promo__label text-gray-800 text-center text-uppercase\">\n\t\t<span class=\"px-4 bg-white display-inline-block font-weight-semibold small\">video series<\/span>\n\t<\/p>\n\t\n\t<div class=\"row pt-3 pb-4 align-items-center\">\n\t\t\t\t\t\t<div class=\"msr-promo__media col-12 col-md-5\">\n\t\t\t\t<a class=\"bg-gray-300 display-block\" href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-label=\"On Second Thought\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t<img decoding=\"async\" class=\"w-100 display-block\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2026\/01\/MFST_feature_SecondThought_1400x788.jpg\" alt=\"On Second Thought with Sinead Bovell\" \/>\n\t\t\t\t<\/a>\n\t\t\t<\/div>\n\t\t\t\n\t\t\t<div class=\"msr-promo__content p-3 px-5 col-12 col-md\">\n\n\t\t\t\t\t\t\t\t\t<h2 class=\"h4\">On Second Thought<\/h2>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<p id=\"on-second-thought\" class=\"large\">A video series with Sinead Bovell built around the questions everyone\u2019s asking about AI. With expert voices from across Microsoft, we break down the tension and promise of this rapidly changing technology, exploring what\u2019s evolving and what\u2019s possible.<\/p>\n\t\t\t\t\n\t\t\t\t\t\t\t\t<div class=\"wp-block-buttons justify-content-center justify-content-md-start\">\n\t\t\t\t\t<div class=\"wp-block-button\">\n\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/story\/on-second-thought\/\" aria-describedby=\"on-second-thought\" class=\"btn btn-brand glyph-append glyph-append-chevron-right\" data-bi-cN=\"On Second Thought\" target=\"_blank\">\n\t\t\t\t\t\t\tExplore the series\t\t\t\t\t\t<\/a>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<\/div><!--\/.msr-promo__content-->\n\t<\/div><!--\/.msr-promo__inner-wrap-->\n\t<\/div><!--\/.msr-promo-->\n\t\n\n\n<h2 class=\"wp-block-heading\" id=\"optimization-experiments\">Optimization experiments<\/h2>\n\n\n\n<p>To further improve throughput and efficiency, we ran a series of optimization experiments. Each approach came with trade-offs that we carefully measured.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"batch-endpoints-azure-openai\">Batch endpoints (Azure\/OpenAI)<\/h3>\n\n\n\n<p>Batch endpoints are a cost-effective, moderately high-throughput way of executing LLM requests. Batch endpoints process large lists of LLM prompts over a 24-hour period, recording responses in a file. They are about 50% cheaper than non-batch endpoints and have separate token limits, enabling increased throughput when used alongside regular endpoints. However, they require at least 24 hours to complete requests and provide lower overall throughput compared to non-batch endpoints, making them unsuitable for situations needing quick results.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"conversation-batching-in-prompts-during-pipeline-runtime\">Conversation batching in prompts during pipeline runtime<\/h3>\n\n\n\n<p>Batching multiple conversations for classification at once can significantly increase throughput and reduce token usage, but it may impact the accuracy of results. In our experiment with a domain classifier, classifying 10 conversations simultaneously led to an average of 15-20% of domain assignments changing between repeated runs of the same prompt. To address this, one mitigation approach is to use a grader LLM prompt: first classify the batch, then have the LLM identify any incorrectly classified conversations, and finally re-classify those as needed. While batching offers efficiency gains, it is important to monitor for potential drops in classification quality.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"combining-classifiers-in-a-single-prompt\">Combining classifiers in a single prompt<\/h3>\n\n\n\n<p>Combining multiple classifiers into a single prompt increases throughput by allowing one call to the LLM instead of multiple calls. This not only multiplies the overall throughput by the number of classifiers processed but also reduces the total number of tokens used, since the conversation text is only passed in once. However, this approach may compromise classification accuracy, so results should be closely monitored.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"classification-using-text-embeddings\">Classification using text embeddings<\/h3>\n\n\n\n<p>An alternative approach is to train custom neural network models for each classifier using only the text embeddings of conversations. This method delivers both cost and time savings by avoiding making multiple LLM requests for every classifier and conversation\u2014instead, the system only needs to request conversation text embeddings once and can reuse these embeddings across all classifier models.<\/p>\n\n\n\n<p>For example, starting with a set of conversations to validate and test the new model, run these conversations through the original prompt-based classifier to generate a set of golden classifications, then obtain text embeddings (using a tool like text-embedding-3-large) for each conversation. These embeddings and their corresponding classifications are used to train a model such as a multi-layer perceptron. In production, the workflow involves retrieving the text embedding for each conversation and passing it through the trained model; if there is a model for each classifier, a single embedding retrieval per conversation suffices for all classifiers.<\/p>\n\n\n\n<p>The benefits of this approach include significantly increased throughput and cost savings\u2014since it\u2019s not necessary to call the LLM for every classifier and conversation. However, this setup can require GPU compute which can increase costs and infrastructure complexity, and the resulting models may not achieve the same accuracy as prompt-based classification methods.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"prompt-compression\">Prompt compression<\/h3>\n\n\n\n<p>Compressing prompts by eliminating unnecessary tokens or by using a tool such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/microsoft\/LLMLingua\" target=\"_blank\" rel=\"noopener noreferrer\">LLMLingua<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to automate prompt compression can optimize classification prompts either ahead of time or in real-time. This approach increases overall throughput and results in cost savings due to a reduced number of tokens, but there are risks: changes to the classifier prompt or conversation text may impact classification accuracy, and depending on the compression technique, it could even decrease throughput if the compression process takes longer than simply sending uncompressed text to the LLM.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"text-truncation\">Text truncation<\/h3>\n\n\n\n<p>Truncating conversations to a specific length limits the overall number of tokens sent through an endpoint, offering cost savings and increased throughput like prompt compression. By reducing the number of tokens per request, throughput rises because more requests can be made before reaching the endpoint\u2019s tokens-per-minute (TPM) limit, and costs decrease due to fewer tokens being processed. However, the ideal truncation length depends on both the classifiers and the conversation content, so it\u2019s important to assess how truncation affects output quality before implementation. While this approach brings clear efficiency benefits, it also poses a risk: long conversations may have their most important content cut off, which can reduce classification accuracy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"conclusion\">Conclusion<\/h2>\n\n\n\n<p>Building a scalable, high-throughput pipeline for LLM-based classification is far from trivial. It requires navigating a constantly shifting landscape of model capabilities, prompt behaviors, and infrastructure constraints. As LLMs become faster, cheaper, and more capable, they\u2019re unlocking new possibilities for real-time understanding of human-AI interactions at scale. The techniques we\u2019ve shared represent a snapshot of what\u2019s working today. But more importantly, they offer a foundation for what\u2019s possible tomorrow.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Semantic Telemetry helps LLMs run efficiently, reliably, and in near real-time. Learn about the engineering behind that system, including the trade-offs and lessons learned along the way\u2014from batching strategies to token optimization and orchestration.<\/p>\n","protected":false},"author":43868,"featured_media":1144473,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Amber Hoak","user_id":"37992"},{"type":"user_nicename","value":"David Tittsworth","user_id":"38064"},{"type":"user_nicename","value":"Kate Lytvynets","user_id":"38073"},{"type":"user_nicename","value":"Scott Counts","user_id":"31471"},{"type":"user_nicename","value":"Weiwei Yang","user_id":"40138"},{"type":"user_nicename","value":"Ben Cutler","user_id":"31188"},{"type":"user_nicename","value":"Jonathan McLean","user_id":"40399"}],"msr_hide_image_in_river":null,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556,13554],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[269148,243984,269142],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-1144422","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-research-area-human-computer-interaction","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-blog-homepage-featured","msr-post-option-include-in-river"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[901101],"related-projects":[1119417],"related-events":[],"related-researchers":[{"type":"user_nicename","value":"Amber Hoak","user_id":37992,"display_name":"Amber Hoak","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/amhoak\/\" aria-label=\"Visit the profile page for Amber Hoak\">Amber Hoak<\/a>","is_active":false,"last_first":"Hoak, Amber","people_section":0,"alias":"amhoak"},{"type":"user_nicename","value":"David Tittsworth","user_id":38064,"display_name":"David Tittsworth","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/datittsw\/\" aria-label=\"Visit the profile page for David Tittsworth\">David Tittsworth<\/a>","is_active":false,"last_first":"Tittsworth, David","people_section":0,"alias":"datittsw"},{"type":"user_nicename","value":"Kate Lytvynets","user_id":38073,"display_name":"Kate Lytvynets","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/kalytv\/\" aria-label=\"Visit the profile page for Kate Lytvynets\">Kate Lytvynets<\/a>","is_active":false,"last_first":"Lytvynets, Kate","people_section":0,"alias":"kalytv"},{"type":"user_nicename","value":"Scott Counts","user_id":31471,"display_name":"Scott Counts","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/counts\/\" aria-label=\"Visit the profile page for Scott Counts\">Scott Counts<\/a>","is_active":false,"last_first":"Counts, Scott","people_section":0,"alias":"counts"},{"type":"user_nicename","value":"Weiwei Yang","user_id":40138,"display_name":"Weiwei Yang","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/weiwya\/\" aria-label=\"Visit the profile page for Weiwei Yang\">Weiwei Yang<\/a>","is_active":false,"last_first":"Yang, Weiwei","people_section":0,"alias":"weiwya"},{"type":"user_nicename","value":"Jonathan McLean","user_id":40399,"display_name":"Jonathan McLean","author_link":"<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/people\/jomclean\/\" aria-label=\"Visit the profile page for Jonathan McLean\">Jonathan McLean<\/a>","is_active":false,"last_first":"McLean, Jonathan","people_section":0,"alias":"jomclean"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-960x540.jpg\" class=\"img-object-cover\" alt=\"The image features four white icons on a gradient background that transitions from blue on the left to green on the right. The first icon is a network or molecule structure with interconnected nodes. The second icon shows a stylized person in front of a computer screen. The third icon shows an organization tree with one main node and three nodes branching out side by side below it.\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-960x540.jpg 960w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-300x169.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1024x576.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-768x432.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1066x600.jpg 1066w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-655x368.jpg 655w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-240x135.jpg 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-640x360.jpg 640w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1-1280x720.jpg 1280w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/07\/SemanticTelemetry3-BlogHeroFeature-1400x788-1.jpg 1400w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"","formattedDate":"July 23, 2025","formattedExcerpt":"Semantic Telemetry helps LLMs run efficiently, reliably, and in near real-time. Learn about the engineering behind that system, including the trade-offs and lessons learned along the way\u2014from batching strategies to token optimization and orchestration.","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1144422","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/users\/43868"}],"replies":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/comments?post=1144422"}],"version-history":[{"count":10,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1144422\/revisions"}],"predecessor-version":[{"id":1144785,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/posts\/1144422\/revisions\/1144785"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1144473"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1144422"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/categories?post=1144422"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/tags?post=1144422"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1144422"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=1144422"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=1144422"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1144422"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1144422"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1144422"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=1144422"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=1144422"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}