{"id":731227,"date":"2021-03-05T06:36:52","date_gmt":"2021-03-05T14:36:52","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=731227"},"modified":"2023-11-13T23:10:21","modified_gmt":"2023-11-14T07:10:21","slug":"estimating-gpu-memory-consumption-of-deep-learning-models","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/estimating-gpu-memory-consumption-of-deep-learning-models\/","title":{"rendered":"Estimating GPU Memory Consumption of Deep Learning Models"},"content":{"rendered":"<p>Deep learning (DL) has been increasingly adopted by a variety of software-intensive systems. Developers mainly use GPUs to accelerate the training, testing, and deployment of DL models. However, the GPU memory consumed by a DL model is often unknown to them before a DL job is executed. Therefore, an improper choice of neural network architecture or hyperparameters can cause the DL job to run out of the limited GPU memory and fail. Our empirical study has found that many DL job failures are due to the exhaustion of GPU memory. This leads to a horrendous waste of computing resources and a significant reduction in development productivity. In this paper, we propose DNNMem, an accurate estimation tool for GPU memory consumption of DL models. DNNMem employs an analytic estimation approach to systematically calculate the memory consumption of both the computation graph and the DL framework runtime. We have evaluated DNNMem on 5 real-world representative models with different hyperparameters under 3 mainstream frameworks (TensorFlow, PyTorch, and MXNet). Our extensive experiments show that DNNMem is effective in estimating GPU memory consumption.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deep learning (DL) has been increasingly adopted by a variety of software-intensive systems. Developers mainly use GPUs to accelerate the training, testing, and deployment of DL models. However, the GPU memory consumed by a DL model is often unknown to them before a DL job is executed. Therefore, an improper choice of neural network architecture [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13556,13560,13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[253600,246694,253603,246691,246658,248569,253597,246685,253606,248224],"msr-conference":[259201],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-731227","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-programming-languages-software-engineering","msr-research-area-systems-and-networking","msr-locale-en_us","msr-field-of-study-accurate-estimation","msr-field-of-study-artificial-intelligence","msr-field-of-study-computation-graph","msr-field-of-study-computer-science","msr-field-of-study-deep-learning","msr-field-of-study-empirical-research","msr-field-of-study-hyperparameter","msr-field-of-study-machine-learning","msr-field-of-study-neural-network-architecture","msr-field-of-study-software-deployment"],"msr_publishername":"ACM","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-11-7","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"The 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Industry Track","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/09\/dnnmem.pdf","id":"726073","title":"dnnmem","label_id":"243103","label":0},{"type":"doi","viewUrl":"false","id":"false","title":"10.1145\/3368089.3417050","label_id":"243106","label":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Yanjie Gao","user_id":34966,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yanjie Gao"},{"type":"text","value":"Yu Liu","user_id":0,"rest_url":false},{"type":"text","value":"Hongyu Zhang","user_id":0,"rest_url":false},{"type":"text","value":"Zhengxian Li","user_id":0,"rest_url":false},{"type":"text","value":"Yonghao Zhu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Haoxiang Lin","user_id":31972,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Haoxiang Lin"},{"type":"user_nicename","value":"Mao Yang","user_id":32798,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Mao Yang"}],"msr_impact_theme":[],"msr_research_lab":[199560],"msr_event":[],"msr_group":[510017,920469],"msr_project":[809443],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":809443,"post_title":"AI Tooling and MLOps","post_name":"ai-tooling-and-mlops","post_type":"msr-project","post_date":"2022-01-06 18:35:35","post_modified":"2023-08-12 19:16:12","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-tooling-and-mlops\/","post_excerpt":"In recent years, artificial intelligence (AI), including machine learning (ML) and deep learning (DL), has been widely adopted in many application domains, such as computer vision, speech recognition, natural language processing, and gaming. However, developers currently rely on traditional paradigms for AI development and operation, which causes significant job failures, runtime performance degradation, information breach, etc. and slows down development productivity severely. We adopt technologies from the areas of Systems, Programming Languages, and Software Engineering&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/809443"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/731227","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/731227\/revisions"}],"predecessor-version":[{"id":734647,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/731227\/revisions\/734647"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=731227"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=731227"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=731227"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=731227"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=731227"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=731227"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=731227"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=731227"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=731227"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=731227"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=731227"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=731227"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=731227"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=731227"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=731227"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=731227"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}