{"id":940866,"date":"2023-05-12T21:14:59","date_gmt":"2023-05-13T04:14:59","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/"},"modified":"2023-05-12T21:14:59","modified_gmt":"2023-05-13T04:14:59","slug":"predicate-pushdown-for-data-science-pipelines","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/predicate-pushdown-for-data-science-pipelines\/","title":{"rendered":"Predicate Pushdown for Data Science Pipelines"},"content":{"rendered":"
Predicate pushdown is a widely adopted query optimization. Existing systems and prior work mostly use
\npattern-matching rules to decide when a predicate can be pushed through certain operators like join or groupby.
\nHowever, challenges arise in optimizing for data science pipelines due to the widely used non-relational
\noperators and user-defined functions (UDF) that existing rules would fail to cover. In this paper, we present
\nMagicPush, which decides predicate pushdown using a search-verification approach. MagicPush searches for
\ncandidate predicates on pipeline input, which is often not the same as the predicate to be pushed down, and
\nverifies that the pushdown does not change pipeline output with full correctness guarantees. Our evaluation
\non TPC-H queries and 200 real-world pipelines sampled from GitHub Notebooks shows that MagicPush
\nsubstantially outperforms a strong baseline that uses a union of rules from prior work \u2013 it is able to discover
\nnew pushdown opportunities and better optimize 42 real-world pipelines with up to 99% reduction in running
\ntime, while discovering all pushdown opportunities found by the existing baseline on remaining cases.<\/p>\n","protected":false},"excerpt":{"rendered":"
Predicate pushdown is a widely adopted query optimization. Existing systems and prior work mostly use pattern-matching rules to decide when a predicate can be pushed through certain operators like join or groupby. However, challenges arise in optimizing for data science pipelines due to the widely used non-relational operators and user-defined functions (UDF) that existing rules […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[246574],"research-area":[13563],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[260221],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-940866","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-highlight-award","msr-research-area-data-platform-analytics","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2023-6","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"Best Paper Award","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/05\/predicate_pushdown_final.pdf","id":"940869","title":"predicate_pushdown_final","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[{"id":940869,"url":"https:\/\/www.microsoft.com\/en-us\/research\/uploads\/prod\/2023\/05\/predicate_pushdown_final.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Cong Yan","user_id":39441,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Cong Yan"},{"type":"text","value":"Yin Lin","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Yeye He","user_id":34992,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yeye He"}],"msr_impact_theme":[],"msr_research_lab":[199565],"msr_event":[],"msr_group":[957177],"msr_project":[967218],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":967218,"post_title":"Self-service Data Preparation","post_name":"self-service-data-preparation","post_type":"msr-project","post_date":"2023-11-08 14:36:00","post_modified":"2023-11-18 10:15:39","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/self-service-data-preparation\/","post_excerpt":"It is often cited that data scientists spend a significant portion of their time (up to 80%), cleaning and preparing data. For less-technical users, who may be less proficient in writing code (e.g., in Excel, Power-BI and Tableau), the tasks of preparing and cleaning data are not just time-consuming, but also technically challenging. In the \"Self-service Data Preparation\" project, our goal is to develop technologies that can automate common data-preparation tasks, in the context of…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/967218"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/940866"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/940866\/revisions"}],"predecessor-version":[{"id":940872,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/940866\/revisions\/940872"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=940866"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=940866"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=940866"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=940866"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=940866"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=940866"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=940866"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=940866"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=940866"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=940866"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=940866"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=940866"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=940866"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=940866"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=940866"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=940866"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}