{"id":557136,"date":"2018-12-14T03:29:20","date_gmt":"2018-12-14T11:29:20","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=557136"},"modified":"2019-11-19T09:29:06","modified_gmt":"2019-11-19T17:29:06","slug":"gandiva-scheduler-for-dnns","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/gandiva-scheduler-for-dnns\/","title":{"rendered":"Gandiva: Scheduler for DNNs"},"content":{"rendered":"

Gandiva is a cluster scheduling framework that utilizes domain-specific knowledge\u00a0of deep learning to improve the efficiency of training deep learning models in a GPU cluster. By co-design of the cluster scheduler and the deep learning framework (e.g. pyTorch), Gandiva is able to communicate richer information and exercise richer control between the two layers, enabling better scheduling.<\/p>\n

The two key requirements of a scheduler for deep learning jobs are to provide (a)\u00a0low-latency feedback (to enable fast iteration during hyper-parameter search or AutoML), and (b) high\u00a0resource efficiency (for managing cost). Gandiva achieves these twin goals\u00a0 by exploiting a key characteristic of deep learning: intra-job predictability.\u00a0 <\/em>Deep learning training jobs perform numerous repetitive iterations called mini-batchs, with each mini-batch being nearly identical to other mini-batches in terms of resource usage. Gandiva exploits such intra-job predictability to time-slice GPUs efficiently across multiple jobs, thereby delivering low-latency. This predictability is also used for introspecting job performance and dynamically migrating jobs to better-fit GPUs, thereby improving cluster efficiency. The knowledge of internal characteristics of a job (such as mini-batch boundaries) enables the Gandiva scheduler to perform application-aware profiling:\u00a0 for example, decisions on migration are taken based on actual\u00a0useful\u00a0<\/i>application throughput, rather than black-box metrics such as utilization that conflate useful work with overhead due to interference.<\/p>\n","protected":false},"excerpt":{"rendered":"

Gandiva is a cluster scheduling framework that utilizes domain-specific knowledge of deep learning to improve the efficiency of training deep learning models in a GPU cluster.<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13547],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-557136","msr-project","type-msr-project","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2017-10-01","related-publications":[559971],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Ramachandran Ramjee","user_id":33337,"people_section":"Section name 1","alias":"ramjee"},{"type":"user_nicename","display_name":"Nipun Kwatra","user_id":37634,"people_section":"Section name 1","alias":"nkwatra"},{"type":"user_nicename","display_name":"Fan Yang","user_id":31782,"people_section":"Section name 1","alias":"fanyang"},{"type":"user_nicename","display_name":"Lidong Zhou","user_id":32673,"people_section":"Section name 1","alias":"lidongz"}],"msr_research_lab":[199560,199562],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/557136"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":7,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/557136\/revisions"}],"predecessor-version":[{"id":621936,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/557136\/revisions\/621936"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=557136"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=557136"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=557136"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=557136"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=557136"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}