{"id":826999,"date":"2022-03-16T09:27:57","date_gmt":"2022-03-16T16:27:57","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=826999"},"modified":"2022-04-07T03:24:52","modified_gmt":"2022-04-07T10:24:52","slug":"varuna-scalable-low-cost-training-of-massive-deep-learning-models","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/varuna-scalable-low-cost-training-of-massive-deep-learning-models\/","title":{"rendered":"Varuna: Scalable, Low-cost Training of Massive Deep Learning Models"},"content":{"rendered":"
Systems for training massive deep learning models (billions of parameters) today assume and require specialized “hyper-clusters”: hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyper-clusters. In this paper, we present Varuna, a new system that enables training massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user’s training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage “low-priority” VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper “spot VMs”, while maintaining high training throughput. Varuna improves end-to-end training time by up to 18x compared to other model-parallel approaches and up to 26% compared to other pipeline parallel approaches. The code for Varuna is available at https:\/\/github.com\/microsoft\/varuna (opens in new tab)<\/span><\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":" Systems for training massive deep learning models (billions of parameters) today assume and require specialized “hyper-clusters”: hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[246574],"research-area":[13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[246694,249193,246691,246658,246793,249232,263779,256402,258988,249241,257725],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-826999","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-highlight-award","msr-research-area-systems-and-networking","msr-locale-en_us","msr-field-of-study-artificial-intelligence","msr-field-of-study-code-cryptography","msr-field-of-study-computer-science","msr-field-of-study-deep-learning","msr-field-of-study-distributed-computing","msr-field-of-study-infiniband","msr-field-of-study-market-fragmentation","msr-field-of-study-pipeline-software","msr-field-of-study-resource-project-management","msr-field-of-study-scalability","msr-field-of-study-throughput-business"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2022-4-22","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"Best Paper Award","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/dl.acm.org\/doi\/pdf\/10.1145\/3492321.3519584","label_id":"243109","label":0},{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/03\/varuna-eurosys22.pdf","id":"833485","title":"varuna-eurosys22","label_id":"243132","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":833485,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2022\/04\/varuna-eurosys22.pdf"}],"msr-author-ordering":[{"type":"guest","value":"sanjith-athlur","user_id":827002,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=sanjith-athlur"},{"type":"guest","value":"nitika-saran","user_id":827005,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=nitika-saran"},{"type":"user_nicename","value":"Muthian Sivathanu","user_id":36320,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Muthian Sivathanu"},{"type":"user_nicename","value":"Ramachandran Ramjee","user_id":33337,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ramachandran Ramjee"},{"type":"user_nicename","value":"Nipun Kwatra","user_id":37634,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Nipun Kwatra"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[826969],"msr_group":[],"msr_project":[968667],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":968667,"post_title":"AI Infrastructure","post_name":"ai-infrastructure","post_type":"msr-project","post_date":"2023-10-04 00:44:27","post_modified":"2024-08-01 03:26:39","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-infrastructure\/","post_excerpt":"Towards efficient AI\/ML deployment The AI Infrastructure team at Microsoft Research India works on cutting-edge systems optimizations for improving the efficiency of a variety of AI\/ML workloads, including an emerging class of workloads, namely, serving large language models (LLMs). AI\/ML models are expensive to train and serve at scale and therefore, systems optimizations are crucial for unlocking the true potential of AI-powered applications. The key principle behind many of our projects is co-design of Systems and…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/968667"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/826999"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/826999\/revisions"}],"predecessor-version":[{"id":833488,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/826999\/revisions\/833488"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=826999"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=826999"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=826999"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=826999"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=826999"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=826999"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=826999"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=826999"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=826999"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=826999"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=826999"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=826999"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=826999"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=826999"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=826999"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=826999"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}