Index tuning has been crucial for database performance. State-of-the-art index tuners rely on query optimizer’s cost estimates to search for the index configuration with the largest estimated execution cost improvement. Due to well-known limitations in optimizer’s estimates, in a significant fraction of cases, an index estimated to improve a query’s execution cost, e.g., CPU time, makes that worse when implemented, i.e., the query regresses. Such errors are a major impediment for automated indexing in production systems.
In this project, we propose techniques to improve index tuning without causing query regressions, including:
Harnessing the best of many plans using Plan Stitch:
Query performance regression due to the query optimizer selecting a bad query execution plan is a major pain point in production workloads. Commercial DBMSs today can automatically detect and correct such query plan regressions by storing previouslyexecuted plans and reverting to a previous plan which is still valid and has the least execution cost. Such reversion-based plan correction has relatively low risk of plan regression since the decision is based on observed execution costs. However, this approach ignores potentially valuable information of efficient subplans collected from other previously-executed plans. In this paper, we propose a novel technique, Plan Stitch, that automatically and opportunistically combines efficient subplans of previously-executed plans into a valid new plan, which can be cheaper than any individual previously-executed plan. We implement Plan Stitch on top of Microsoft SQL Server. Our experiments on TPC-DS benchmark and three real-world customer workloads show that plans obtained via Plan Stitch can reduce execution cost significantly, with a reduction of up to two orders of magnitude in execution cost when compared to reverting to the cheapest previously-executed plan.
Leveraging query execution statistics to improve index recommendation quality:
We observe that comparing the execution cost of two plans of the same query corresponding to different index configurations is a key step during index tuning. Instead of using optimizer’s estimates for such comparison, our key insight is that formulating it as a classification task in machine learning results in significantly higher accuracy. We present a study of the design space for this classification problem. We further show how to integrate this classifier into the state-of-the-art index tuners with minimal modifications. Our evaluation using industry-standard benchmarks and a large number of real customer workloads demonstrates up to 5× reduction in the errors in identifying the cheaper plan in a pair, which eliminates almost all query execution cost regressions when the model is used in index tuning.
Actively acquiring execution data to improve ML model’s performance:
While the machine learning model works well when there is sufficient execution statistics, we observe that the model’s performance degrades significantly when the test data diverges from the data used to train these models. We address this performance degradation by using B-instances to collect additional data during deployment. We propose an active data collection platform, ADCP, that employs active learning (AL) to gather relevant data cost-effectively. We develop a novel AL technique, Holistic Active Learner (HAL), that robustly combines multiple noisy signals for data gathering in the context of database applications. HAL applies to various ML tasks, budget sizes, cost types, and budgeting interfaces for database applications. We evaluate ADCP on both industry-standard benchmarks and real customer workloads. Our evaluation shows that, compared with other baselines, our technique improves ML models’ prediction performance by up to 2× with the same cost budget. In particular, on production workloads, our technique reduces the prediction error of ML models by 75% using about 100 additionally collected queries.
Overview Talk at AIDB@VLDB 2020:
Slide deck: link