{"id":472845,"date":"2018-03-20T10:20:05","date_gmt":"2018-03-20T17:20:05","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=472845"},"modified":"2024-04-24T08:19:12","modified_gmt":"2024-04-24T15:19:12","slug":"fiddle","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/fiddle\/","title":{"rendered":"Project Fiddle"},"content":{"rendered":"<p><strong>Project Fiddle: Fast and Efficient Infrastructure for Distributed Deep Learning<\/strong><\/p>\n<p>The goal of Project Fiddle is to build efficient systems infrastructure for very fast distributed DNN training. Our goal is to support 100x more efficient training. To achieve this goal, we take a broad view of training: from a single GPU, to multiple GPUs on a machine, all the way to training on large multi-machines clusters. Our innovations cut across the systems stack: the memory subsystem, structuring parallel computation across GPUs and machines, and interconnects between GPUs and across machines.<\/p>\n<p>Our work so far has targeted many different parts of the systems stack (organized as different sub-projects)<\/p>\n<ul>\n<li>Gist: In Gist, we ask: how far can we push the limits of single-GPU training? Specifically, we explore training larger networks on a single GPU by slashing down training memory footprint.<\/li>\n<li>PipeDream: Unlike other big-data workloads, DNN training is not na\u00efvely parallelizable because one has to strike a balance between hardware and statistical efficiency. Time to achieve the desired accuracy is what matters. We have designed a new way to systematically parallelize DNN computation to efficiently scale training by combining model parallelism, data parallelism, and pipelining.<\/li>\n<li>Blink: There are exciting advances in inter-GPU interconnects such as NVLink on a single machine and Infiniband when using GPU-direct RDMA. But these advances bring with them heterogeneity as a challenge for data-transfer protocol developers. Blink is a library targeted to speed up inter-GPU communication in parallel training; it shields developers from interconnect heterogeneity while autogenerating transfer schedules for collectives that maximize interconnect link usage.<\/li>\n<li>CoorDL, CheckFreq: Optimized data loading and checkpointing libraries for DNN training.<\/li>\n<li>Harmony: Framework support for swapping data structures across CPU and GPU memory to enable developing, debugging, and fine tuning of massive DNN models on modest deployments (where the memory footprint of these models exceeds the total memory capacity of commodity servers). <\/li>\n<li>Fast, fair, and heterogeneity-aware multi-tenant cluster schedulers and scheduler toolkit (Synergy, Gavel, Themis, Blox).<\/li>\n<p> <!-- Fault-tolerant, and Elastic Training in Multi-tenant Clusters. --><\/p>\n<li>Our work in Fiddle is grounded in rigorous profiling and benchmarking, while building helpful tools along the way. Our profiling work spans single-GPU training [<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"http:\/\/www.tbd-suite.ai\/\">TBD (Training Benchmark for DNNs)<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>], multi-GPU training, and cluster-wide profiling and characterization across multiple jobs (<a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/msr-fiddle\/philly-traces\">Philly Traces<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>).  We have also built tools, such as Daydream, that given a model and a deployment scenario, can efficiently explore the efficacy of potential solutions without implementing or running them.<\/li>\n<\/ul>\n<p>More recently we have also looked at identifying and solving such problems across the systems stack for DNN inference (of discriminative and generative workloads) and serving systems.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The goal of Project Fiddle is to build efficient systems infrastructure for very fast distributed DNN training. Our goal is to support 100x more efficient training. Our innovations cut across the systems stack: the memory subsystem, structuring parallel computation across GPUs and machines, and interconnects between GPUs and across machines.<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13547],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-472845","msr-project","type-msr-project","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"","related-publications":[584548,555732,500675,500666,491096,480777,484098,668463,786991,671448,817150,677652,829969,685476,911184,690591,963933,713632,997785,601923,742540,997827,627342,744928,1043634,627351,744934,1043658,655707,762739,1043682,663714,779569,1047918],"related-downloads":[666897],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Associated Members","content":"MSR Members and Collaborators\r\n<ul>\r\n \t<li>PI: Amar Phanishayee<\/li>\r\n \t<li>Janardhan Kulkarni<\/li>\r\n \t<li>Jakub Tarnawski<\/li>\r\n \t<li>Vivek Seshadri<\/li>\r\n \t<li>Jayasree Mohan<\/li>\r\n \t<li>Deepak Narayanan<\/li>\r\n \t<li>Debadeepta Dey<\/li>\r\n \t<li>Matthai Philipose<\/li>\r\n        <li>Kevin Hsieh<\/li>\r\n<\/ul>\r\n\r\n\r\nAlumni\r\n<ul>\r\n \t<li>Nikhil Devanur<\/li>\r\n \t<li>Myeongjae Jeon<\/li>\r\n \t<li>Jacob Nelson<\/li>\r\n\t<li>Genanady Pekhimenko<\/li>\r\n        <li>Jorgen Thelin<\/li>\r\n \t<li>Shivaram Venkataraman<\/li>\r\n<\/ul>\r\nInterns\r\n<ul>\r\n \t<li>M. Adnan<\/li>\r\n \t<li>Saurabh Agarwal<\/li>\r\n        <li>Ankit Bhardwaj<\/li>\r\n \t<li>Surya Teja Chavali<\/li>\r\n \t<li>Aaron Harlap<\/li>\r\n        <li>Wei Hao<\/li>\r\n \t<li>Animesh Jain<\/li>\r\n \t<li>Jack Kosaian<\/li>\r\n \t<li>Youjie Li<\/li>\r\n \t<li>Liang Luo<\/li>\r\n \t<li>Kshiteej Mahajan<\/li>\r\n \t<li>Daniel Mendoza<\/li>\r\n \t<li>Jayashree Mohan<\/li>\r\n \t<li>Deepak Narayanan<\/li>\r\n \t<li>Andrew Or<\/li>\r\n \t<li>Keshav Santhanam<\/li>\r\n \t<li>Foteini Strati<\/li>\r\n \t<li>Guanhua Wang<\/li>\r\n \t<li>Hongyu Zhu<\/li>\r\n<\/ul>"}],"slides":[],"related-researchers":[],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/472845"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":28,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/472845\/revisions"}],"predecessor-version":[{"id":1027581,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/472845\/revisions\/1027581"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=472845"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=472845"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=472845"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=472845"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=472845"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}