Predicate Pushdown for Data Science Pipelines

SIGMOD |

Predicate pushdown is a widely adopted query optimization. Existing systems and prior work mostly use
pattern-matching rules to decide when a predicate can be pushed through certain operators like join or groupby.
However, challenges arise in optimizing for data science pipelines due to the widely used non-relational
operators and user-defined functions (UDF) that existing rules would fail to cover. In this paper, we present
MagicPush, which decides predicate pushdown using a search-verification approach. MagicPush searches for
candidate predicates on pipeline input, which is often not the same as the predicate to be pushed down, and
verifies that the pushdown does not change pipeline output with full correctness guarantees. Our evaluation
on TPC-H queries and 200 real-world pipelines sampled from GitHub Notebooks shows that MagicPush
substantially outperforms a strong baseline that uses a union of rules from prior work – it is able to discover
new pushdown opportunities and better optimize 42 real-world pipelines with up to 99% reduction in running
time, while discovering all pushdown opportunities found by the existing baseline on remaining cases.