Efficiently Approximating Selectivity Functions using Low Overhead Regression Models

46th International Conference on Very Large Data Bases (VLDB 2020) |

Today’s query optimizers use fast selectivity estimation techniques but are known
to be susceptible to large estimation errors. Recent work on supervised learned
models for selectivity estimation significantly improves accuracy while ensuring
relatively low estimation overhead. However, these models impose significant model
construction cost as they need large numbers of training examples and computing
selectivity labels is costly for large datasets. We propose a novel model construction
method that incrementally generates training data and uses approximate selectivity labels,
that reduces total construction cost by an order of magnitude while preserving most of
the accuracy gains. The proposed method is particularly attractive for model designs
that are faster-to-train for a given number of training examples, but such models are
known to support a limited class of query expressions. We broaden the applicability of such
supervised models to the class of select-project-join query expressions with range predicates
and IN clauses. Our extensive evaluation on synthetic benchmark and real-world queries shows
that the 95th-percentile error of our proposed models is 10-100× better than traditional selectivity
estimators. We also demonstrate significant gains in plan quality as a result of improved
selectivity estimates.