{"id":1150284,"date":"2025-10-22T14:31:38","date_gmt":"2025-10-22T21:31:38","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=1150284"},"modified":"2025-10-22T14:31:41","modified_gmt":"2025-10-22T21:31:41","slug":"kernel%e2%80%91level-innovation-and-hardware%e2%80%91aware-modeling","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/kernel%e2%80%91level-innovation-and-hardware%e2%80%91aware-modeling\/","title":{"rendered":"Kernel\u2011level innovation and hardware\u2011aware modeling\u00a0"},"content":{"rendered":"<section class=\"mb-3 moray-highlight\">\n\t<div class=\"card-img-overlay mx-lg-0\">\n\t\t<div class=\"card-background  has-background-catalina-blue card-background--full-bleed\">\n\t\t\t<img loading=\"lazy\" decoding=\"async\" width=\"1920\" height=\"720\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720.jpg\" class=\"attachment-full size-full\" alt=\"M365 Research banner: network of connected points\" style=\"\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720.jpg 1920w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-300x113.jpg 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1024x384.jpg 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-768x288.jpg 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1536x576.jpg 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-1600x600.jpg 1600w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2024\/06\/M365-Research-Page-Banner_1920x720-240x90.jpg 240w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/>\t\t<\/div>\n\t\t<!-- Foreground -->\n\t\t<div class=\"card-foreground d-flex mt-md-n5 my-lg-5 px-g px-lg-0\">\n\t\t\t<!-- Container -->\n\t\t\t<div class=\"container d-flex mt-md-n5 my-lg-5 \">\n\t\t\t\t<!-- Card wrapper -->\n\t\t\t\t<div class=\"w-100 \">\n\t\t\t\t\t<!-- Card -->\n\t\t\t\t\t<div class=\"card material-md-card py-5 px-md-5\">\n\t\t\t\t\t\t<div class=\"card-body \">\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<a href=\"https:\/\/www.microsoft.com\/en-us\/research\/?p=1145968&post_type=msr-group&preview=1&_ppp=04617fa63e\" class=\"icon-link icon-link--reverse mb-2\" data-bi-cN=\"Efficient AI team\">\n\t\t\t\t\t\t\t\t\t<span class=\"c-glyph glyph-chevron-left\" aria-hidden=\"true\"><\/span>\n\t\t\t\t\t\t\t\t\tEfficient AI team\t\t\t\t\t\t\t\t<\/a>\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\n\n<h1 class=\"wp-block-heading\" id=\"kernel-level-innovation-and-hardware-aware-modeling\">Kernel\u2011level innovation and hardware\u2011aware modeling<\/h1>\n\n\t\t\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t<\/div>\n\t\t<\/div>\n\t<\/div>\n<\/section>\n\n\n\n\n\n<p class=\"wp-block-paragraph\">Our team is driving fundamental innovation at the kernel level to push the boundaries of efficiency in large-scale AI workloads. We are rethinking core attention mechanisms and computational pathways to deliver breakthroughs in performance, memory optimization, and scalability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By redesigning execution flows, introducing advanced quantization strategies, and leveraging emerging hardware capabilities, we aim to eliminate bottlenecks in both compute and communication layers. This project focuses on achieving end-to-end acceleration without compromising accuracy or reliability, enabling models to handle longer contexts and higher throughput at significantly lower cost. Through tight algorithm\u2013hardware co-design and deep integration with production systems, we are building the foundation for next-generation AI infrastructure that is faster, leaner, and more sustainable.<\/p>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"735\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-1024x735.png\" alt=\"Research At A Glance: Hardware-Software Codesign for Efficient AI\" class=\"wp-image-1151757\" style=\"width:662px;height:auto\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-1024x735.png 1024w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-300x215.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-768x551.png 768w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-1536x1102.png 1536w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9-240x172.png 240w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/10\/image-9.png 1863w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<div style=\"height:20px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">Together, these advancements deliver measurable gains in tokens per second, cost per generated token, and energy efficiency while preserving output quality.<\/p>\n\n\n\n<div style=\"height:25px\" aria-hidden=\"true\" class=\"wp-block-spacer\"><\/div>\n\n\n","protected":false},"excerpt":{"rendered":"<p>We design and optimize GPU kernels and model\u2011execution strategies to maximize throughput and minimize latency for real\u2011world LLM workloads. Interactive enterprise scenarios often run at low batch sizes, interleave very long contexts, and have strict latency targets\u2014exposing different bottlenecks than training. <\/p>\n<p>Our work includes attention\u2011kernel optimization for both prefill and decode, sampling and logit\u2011processing improvements, and auto\u2011tuning at the PTX level to balance occupancy, register usage, and memory traffic. We also explore dynamic kernel selection at runtime, choosing kernels based on batch size, context length, and hardware topology to maintain peak efficiency without manual retuning. <\/p>\n<p>Together, these advancements deliver measurable gains in tokens per second and cost per generated token while preserving output quality. <\/p>\n","protected":false},"featured_media":1045266,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":true,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1150284","msr-project","type-msr-project","status-publish","has-post-thumbnail","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[841759,1041231,1041954,1041966,1129848],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"related-researchers":[{"type":"user_nicename","display_name":"Srikant Bharadwaj","user_id":41644,"people_section":"Related people","alias":"srbharadwaj"},{"type":"user_nicename","display_name":"Mirian Hipolito Garcia","user_id":40483,"people_section":"Related people","alias":"mirianh"},{"type":"user_nicename","display_name":"Daniel Eduardo Madrigal Diaz","user_id":40480,"people_section":"Related people","alias":"danielmad"},{"type":"user_nicename","display_name":"Victor Ruehle","user_id":41027,"people_section":"Related people","alias":"virueh"},{"type":"user_nicename","display_name":"Renee St. Amant","user_id":43080,"people_section":"Related people","alias":"reneestamant"}],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150284","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":21,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150284\/revisions"}],"predecessor-version":[{"id":1155057,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/1150284\/revisions\/1155057"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media\/1045266"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1150284"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1150284"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1150284"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1150284"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1150284"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}