{"id":738673,"date":"2021-04-29T20:34:08","date_gmt":"2021-04-30T03:34:08","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=738673"},"modified":"2024-05-01T10:29:20","modified_gmt":"2024-05-01T17:29:20","slug":"flex-high-availability-datacenters-with-zero-reserved-power","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/flex-high-availability-datacenters-with-zero-reserved-power\/","title":{"rendered":"Flex: High-Availability Datacenters With Zero Reserved Power"},"content":{"rendered":"
Cloud providers, like Amazon and Microsoft, must guarantee high availability for a large fraction of their workloads.\u00a0 For this reason, they build datacenters with redundant infrastructures for power delivery and cooling.\u00a0 Typically, the redundant resources are reserved for use only during infrastructure failure or maintenance events, so that workload performance and availability do not suffer.\u00a0 Unfortunately, the reserved resources also produce lower power utilization and, consequently, require more datacenters to be built.\u00a0 To address these problems, in this paper we propose \u201czero-reserved-power\u201d datacenters and the Flex system to ensure that workloads still receive their desired performance and availability.\u00a0 Flex leverages the existence of software-redundant workloads that can tolerate lower infrastructure availability, while imposing minimal (if any) performance degradation for those that require high infrastructure availability.\u00a0 Flex mainly comprises (1) a new of\ufb02ine workload placement policy that reduces stranded power while ensuring safety during failure or maintenance events, and (2) a distributed system that monitors for failures and quickly reduces the power draw while respecting the workloads\u2019 requirements, when it detects a failure.\u00a0 Our evaluation shows that Flex produces less than 5% stranded power and increases the number of deployed servers by up to 33%, which translates to hundreds of millions of dollars in construction cost savings per datacenter site.\u00a0 We end the paper with lessons from our experience bringing Flex to production in Microsoft\u2019s datacenters.<\/p>\n
<\/p>\n","protected":false},"excerpt":{"rendered":"
Cloud providers, like Amazon and Microsoft, must guarantee high availability for a large fraction of their workloads.\u00a0 For this reason, they build datacenters with redundant infrastructures for power delivery and cooling.\u00a0 Typically, the redundant resources are reserved for use only during infrastructure failure or maintenance events, so that workload performance and availability do not suffer.\u00a0 […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-738673","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2021-6-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/FlexMA-DCs-ISCA21.pdf","id":"743026","title":"flexma-dcs-isca21","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":743026,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2021\/04\/FlexMA-DCs-ISCA21.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Chaojie Zhang","user_id":42705,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Chaojie Zhang"},{"type":"user_nicename","value":"Alok Kumbhare","user_id":36086,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Alok Kumbhare"},{"type":"text","value":"Ioannis Manousakis","user_id":0,"rest_url":false},{"type":"text","value":"Deli Zhang","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Pulkit Misra","user_id":38496,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Pulkit Misra"},{"type":"text","value":"Rod Assis","user_id":0,"rest_url":false},{"type":"text","value":"Kyle Woolcock","user_id":0,"rest_url":false},{"type":"text","value":"Nithish Mahalingam","user_id":0,"rest_url":false},{"type":"text","value":"Brijesh Warrier","user_id":0,"rest_url":false},{"type":"text","value":"David Gauthier","user_id":0,"rest_url":false},{"type":"text","value":"Lalu Kunnath","user_id":0,"rest_url":false},{"type":"text","value":"Steve Solomon","user_id":0,"rest_url":false},{"type":"text","value":"Osvaldo Morales","user_id":0,"rest_url":false},{"type":"text","value":"Marcus Fontoura","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ricardo Bianchini","user_id":33393,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ricardo Bianchini"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[144927,282170],"msr_project":[615975],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":615975,"post_title":"Power Efficiency and Sustainability","post_name":"power-capping","post_type":"msr-project","post_date":"2019-10-18 11:34:18","post_modified":"2025-02-05 11:04:52","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/power-capping\/","post_excerpt":"Power Capping and Oversubscription is a collaboration between MSR, Azure Compute, CO+I, and AHSI to harvest stranded datacenter resources via smart performance-aware power capping and oversubscription.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/615975"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/738673","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":4,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/738673\/revisions"}],"predecessor-version":[{"id":1030143,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/738673\/revisions\/1030143"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=738673"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=738673"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=738673"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=738673"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=738673"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=738673"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=738673"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=738673"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=738673"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=738673"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=738673"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=738673"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=738673"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=738673"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=738673"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=738673"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}