{"id":946149,"date":"2023-06-06T15:23:15","date_gmt":"2023-06-06T22:23:15","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=946149"},"modified":"2023-06-14T14:45:23","modified_gmt":"2023-06-14T21:45:23","slug":"hyrax-fail-in-place-server-operation-in-cloud-platforms","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/hyrax-fail-in-place-server-operation-in-cloud-platforms\/","title":{"rendered":"Hyrax: Fail-in-Place Server Operation in Cloud Platforms"},"content":{"rendered":"

Today\u2019s cloud platforms handle server hardware failures by shutting down the affected server and only turning it back online once it has been repaired by a technician. At cloud scale, this all-or-nothing operating model is becoming increasingly unsustainable. This model is also at odds with technology trends, such as the need for new cooling technology.\u00a0 This paper introduces Hyrax, a datacenter stack that enables compute servers with failed components to continue hosting VMs while hiding the underlying degraded capacity and performance. A key enabler of Hyrax is a novel model of changes in memory interleaving when deactivating faulty memory modules. Experiments on cloud production servers show that Hyrax overcomes common hardware failures without impacting peak VM performance. In large-scale simulations with production traces, Hyrax reduces server repair requirements by 50-60% without impacting VM scheduling.<\/p>\n","protected":false},"excerpt":{"rendered":"

Today\u2019s cloud platforms handle server hardware failures by shutting down the affected server and only turning it back online once it has been repaired by a technician. At cloud scale, this all-or-nothing operating model is becoming increasingly unsustainable. This model is also at odds with technology trends, such as the need for new cooling technology.\u00a0 […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-946149","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2023-7-10","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"USENIX","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/Fail_in_Place_OSDI23_public.pdf","id":"949092","title":"fail_in_place_osdi23_public","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":949092,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/Fail_in_Place_OSDI23_public.pdf"},{"id":946158,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/06\/Hyrax-OSDI23.pdf"}],"msr-author-ordering":[{"type":"text","value":"Jialun Lyu","user_id":0,"rest_url":false},{"type":"text","value":"Marisa You","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Celine Irvene","user_id":40636,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Celine Irvene"},{"type":"text","value":"Mark Jung","user_id":0,"rest_url":false},{"type":"text","value":"Tyler Narmore","user_id":0,"rest_url":false},{"type":"text","value":"Jacob Shapiro","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Luke Marshall","user_id":37386,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Luke Marshall"},{"type":"text","value":"Savyasachi Samal","user_id":0,"rest_url":false},{"type":"text","value":"Ioannis Manousakis","user_id":0,"rest_url":false},{"type":"text","value":"Lisa Hsu","user_id":0,"rest_url":false},{"type":"text","value":"Preetha Subbarayalu","user_id":0,"rest_url":false},{"type":"text","value":"Ashish Raniwala","user_id":0,"rest_url":false},{"type":"text","value":"Brijesh Warrier","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":33393,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ricardo Bianchini"},{"type":"text","value":"Bianca Shroeder","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Daniel S. Berger","user_id":38892,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Daniel S. Berger"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[282170],"msr_project":[757045],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":757045,"post_title":"Zissou: New datacenter, server, and software architectures for liquid-cooled systems","post_name":"zissou-new-datacenter-server-and-software-architectures-for-liquid-cooled-systems","post_type":"msr-project","post_date":"2021-06-27 18:19:18","post_modified":"2024-08-29 09:47:19","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/zissou-new-datacenter-server-and-software-architectures-for-liquid-cooled-systems\/","post_excerpt":"The Zissou project is exploring immersion cooling in large-scale cloud platforms. Our main motivation is that chip power has been steadily increasing since the end of Dennard scaling.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/757045"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/946149"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/946149\/revisions"}],"predecessor-version":[{"id":946170,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/946149\/revisions\/946170"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=946149"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=946149"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=946149"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=946149"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=946149"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=946149"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=946149"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=946149"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=946149"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=946149"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=946149"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=946149"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=946149"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=946149"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=946149"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=946149"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}