{"id":1137297,"date":"2025-04-22T07:36:17","date_gmt":"2025-04-22T14:36:17","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=1137297"},"modified":"2025-09-08T07:50:51","modified_gmt":"2025-09-08T14:50:51","slug":"an-empirical-study-of-issues-in-large-language-model-training-systems","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/an-empirical-study-of-issues-in-large-language-model-training-systems\/","title":{"rendered":"An Empirical Study of Issues in Large Language Model Training Systems"},"content":{"rendered":"
Large language models (LLMs) have gained significant traction in recent years, driving advancements in various applications. The training and evaluation of these models depend heavily on specialized LLM training systems, which are deployed across numerous GPUs, partition LLMs, and process large datasets. However, issues in LLM training systems can lead to program crashes or unexpected behavior, reducing development productivity and wasting valuable resources such as GPUs and storage.<\/p>\n
This paper presents the first comprehensive empirical study of issues in LLM training systems. We conducted a manual analysis of 300 high-quality issue reports and corresponding fix commits from the GitHub repositories of three prominent LLM training systems: Microsoft DeepSpeed, NVIDIA Megatron-LM, and Hugging Face Transformers. Our analysis identified common symptoms, root causes, typical fixes, and debugging and testing practices in LLM training systems. Our major findings include: (1) LLM training systems exhibit issues and trends that are uncommon in traditional deep learning, such as Concurrency Error and Tensor Management Error occurring in parallel training, which are particularly difficult to diagnose and resolve. (2) The primary root causes of these issues are API Misuse (19.67%), Configuration Error (18.33%), and General Code Error (16.33%), respectively. Such issues often arise from the rapid evolution of the systems, the integration of complex external dependencies, and a configuration-driven development paradigm. (3) Current testing and debugging practices are often insufficient for identifying issues related to parallel training and large-scale numerical computations. Based on our findings, we propose several research topics and tooling improvements that can facilitate the future development of LLMs.<\/p>\n","protected":false},"excerpt":{"rendered":"
Large language models (LLMs) have gained significant traction in recent years, driving advancements in various applications. The training and evaluation of these models depend heavily on specialized LLM training systems, which are deployed across numerous GPUs, partition LLMs, and process large datasets. However, issues in LLM training systems can lead to program crashes or unexpected […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"ACM","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"FSE 2025","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2025-6-23","msr_highlight_text":"","msr_notes":"The ACM International Conference on the Foundations of Software Engineering, Industry Track","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"https:\/\/conf.researchr.org\/home\/fse-2025","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":false,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13560,13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[259201],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1137297","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-programming-languages-software-engineering","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"ACM","msr_edition":"","msr_affiliation":"","msr_published_date":"2025-6-23","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"The ACM International Conference on the Foundations of Software Engineering, Industry Track","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":0,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/dl.acm.org\/doi\/abs\/10.1145\/3696630.3728538","label_id":"243109","label":0},{"type":"doi","viewUrl":"false","id":"false","title":"10.1145\/3696630.3728538","label_id":"243106","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Yanjie Gao","user_id":34966,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yanjie Gao"},{"type":"text","value":"Ruiming Lu","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Haoxiang Lin","user_id":31972,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Haoxiang Lin"},{"type":"text","value":"Yueguo Chen","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[199560],"msr_event":[],"msr_group":[510017,920469],"msr_project":[809443],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":809443,"post_title":"AI Tooling and MLOps","post_name":"ai-tooling-and-mlops","post_type":"msr-project","post_date":"2022-01-06 18:35:35","post_modified":"2023-08-12 19:16:12","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-tooling-and-mlops\/","post_excerpt":"In recent years, artificial intelligence (AI), including machine learning (ML) and deep learning (DL), has been widely adopted in many application domains, such as computer vision, speech recognition, natural language processing, and gaming. However, developers currently rely on traditional paradigms for AI development and operation, which causes significant job failures, runtime performance degradation, information breach, etc. and slows down development productivity severely. We adopt technologies from the areas of Systems, Programming Languages, and Software Engineering…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/809443"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1137297","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1137297\/revisions"}],"predecessor-version":[{"id":1143371,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1137297\/revisions\/1143371"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1137297"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1137297"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1137297"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1137297"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1137297"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1137297"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1137297"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1137297"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1137297"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1137297"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1137297"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1137297"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1137297"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}