{"id":723682,"date":"2021-02-05T17:19:54","date_gmt":"2021-02-06T01:19:54","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=723682"},"modified":"2021-03-12T15:36:07","modified_gmt":"2021-03-12T23:36:07","slug":"selecting-subexpressions-to-materialize-at-datacenter-scale","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/selecting-subexpressions-to-materialize-at-datacenter-scale\/","title":{"rendered":"Selecting subexpressions to materialize at datacenter scale"},"content":{"rendered":"
We observe significant overlaps in the computations performed by user jobs in modern shared analytics clusters. Naively computing the same subexpressions multiple times results in wasting cluster resources and longer execution times. Given that these shared cluster workloads consist of tens of thousands of jobs, identifying overlapping computations across jobs is of great interest to both cluster operators and users. Nevertheless, existing approaches support orders of magnitude smaller workloads or employ heuristics with limited effectiveness. In this paper, we focus on the problem of subexpression selection for large workloads, i.e., selecting common parts of job plans and materializing them to speed-up the evaluation of subsequent jobs. We provide an ILP-based formulation of our problem and map it to a bipartite graph labeling problem. Then, we introduce BigSubs, a vertex-centric graph algorithm to iteratively choose in parallel which subexpressions to materialize and which subexpressions to use for evaluating each job. We provide a distributed implementation of our approach using our internal SQL-like execution framework, SCOPE, and assess its effectiveness over production workloads. BigSubs supports workloads with tens of thousands of jobs, yielding savings of up to 40% in machine-hours. We are currently integrating our techniques with the SCOPE runtime in our production clusters.<\/p>\n","protected":false},"excerpt":{"rendered":"
We observe significant overlaps in the computations performed by user jobs in modern shared analytics clusters. Naively computing the same subexpressions multiple times results in wasting cluster resources and longer execution times. Given that these shared cluster workloads consist of tens of thousands of jobs, identifying overlapping computations across jobs is of great interest to […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13563,13547],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[246784,251638,251641,250378,246691,251644,250657,248785,248230,248341],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-723682","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-research-area-systems-and-networking","msr-locale-en_us","msr-field-of-study-analytics","msr-field-of-study-bipartite-graph","msr-field-of-study-cluster-physics","msr-field-of-study-computation","msr-field-of-study-computer-science","msr-field-of-study-graph-algorithms","msr-field-of-study-heuristics","msr-field-of-study-labeling-problem","msr-field-of-study-operator-computer-programming","msr-field-of-study-theoretical-computer-science"],"msr_publishername":"VLDB Endowment","msr_edition":"","msr_affiliation":"","msr_published_date":"2018-2-28","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"doi","viewUrl":"false","id":"false","title":"10.14778\/3192965.3192971","label_id":"243106","label":0},{"type":"url","viewUrl":"false","id":"false","title":"http:\/\/www.vldb.org\/pvldb\/vol11\/p800-jindal.pdf","label_id":"243132","label":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Alekh Jindal","user_id":37419,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Alekh Jindal"},{"type":"user_nicename","value":"Konstantinos Karanasos","user_id":32565,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Konstantinos Karanasos"},{"type":"user_nicename","value":"Sriram Rao","user_id":33712,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sriram Rao"},{"type":"text","value":"Hiren Patel","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[496901],"msr_group":[684024],"msr_project":[723529,474279,595033],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":723529,"post_title":"Peregrine","post_name":"peregrine","post_type":"msr-project","post_date":"2021-02-05 16:07:39","post_modified":"2021-02-05 18:32:41","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/peregrine\/","post_excerpt":"Database administrators (DBAs) were traditionally responsible for optimizing the on-premise database workloads. However, with the rise of cloud data services where cloud providers offer fully managed data processing capabilities, the role of a DBA is completely missing. At the same time, workload optimization becomes even more important for reducing the total costs of operation and making data processing economically viable in the cloud. This project revisits workload optimization in the context of these emerging cloud-based…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/723529"}]}},{"ID":474279,"post_title":"CloudViews: Materialized Views for Big Data Workloads","post_name":"cloudviews-materialized-views-big-data-workloads","post_type":"msr-project","post_date":"2018-03-16 15:02:38","post_modified":"2018-03-16 15:21:41","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/cloudviews-materialized-views-big-data-workloads\/","post_excerpt":"Analytics workloads in Microsoft's clusters are generated by users submitting SCOPE scripts, which are compiled down to jobs, which are then executed on the clusters. \u00a0An analysis of these workloads suggests\u00a0that there is significant overlap between the sub-graphs of the different jobs. \u00a0That is, the sub-graphs of different jobs compute the same result. \u00a0 This suggests that we can build upon view materialization to improve performance. \u00a0However, doing view materialization at a massive scale, where…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/474279"}]}},{"ID":595033,"post_title":"CloudViews","post_name":"cloudviews","post_type":"msr-project","post_date":"2019-06-21 20:17:53","post_modified":"2021-02-05 18:33:29","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/cloudviews\/","post_excerpt":"Analytics-as-a-service, or analytics job service, is emerging as a new paradigm for enterprise data analytics. These services are motivated by the fact that setting up and running data analytics is a major hurdle for enterprises. Although platform as a service (PaaS), software as a service (SaaS), and more recently database as a service (DBaaS) have eased the pain of provisioning and scaling hardware and software infrastructures, users are still responsible for managing and tuning their…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/595033"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/723682"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/723682\/revisions"}],"predecessor-version":[{"id":723685,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/723682\/revisions\/723685"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=723682"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=723682"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=723682"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=723682"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=723682"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=723682"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=723682"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=723682"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=723682"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=723682"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=723682"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=723682"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=723682"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=723682"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=723682"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=723682"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}