{"id":255978,"date":"2016-07-14T11:56:45","date_gmt":"2016-07-14T18:56:45","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&p=255978"},"modified":"2018-12-17T18:30:53","modified_gmt":"2018-12-18T02:30:53","slug":"sample-seek-approximating-aggregates-distribution-precision-guarantee-2","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/sample-seek-approximating-aggregates-distribution-precision-guarantee-2\/","title":{"rendered":"Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee"},"content":{"rendered":"

Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By <\/em>that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors are usually tolerable for such queries, approximate query processing (AQP) has the potential to answer them over very large datasets much faster. In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for SUM queries. To address those limitations, we first introduce a new precision metric, called distribution precision<\/em>, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user-specified error bound. The main challenges are to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples. We propose a novel sampling scheme called measure-biased sampling<\/em> to address the former challenge. For the latter, we propose two new indexes to augment in-memory samples. Like other sampling-based AQP techniques, our solution supports any aggregate that can be estimated from random samples. In addition to deriving theoretical guarantees, we conduct experimental study to compare our system with state-of-the-art AQP techniques and a commercial column-store database system on both synthetic and real enterprise datasets. Our system provides a median speed-up of more than 100x with around 5% distribution error compared with the commercial database.<\/p>\n","protected":false},"excerpt":{"rendered":"

Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors […]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"msr-content-type":[3],"msr-research-highlight":[],"research-area":[13561,13563],"msr-publication-type":[193716],"msr-product-type":[],"msr-focus-area":[],"msr-platform":[],"msr-download-source":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-255978","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-algorithms","msr-research-area-data-platform-analytics","msr-locale-en_us"],"msr_publishername":"ACM - Association for Computing Machinery","msr_edition":"","msr_affiliation":"","msr_published_date":"2016-7-14","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"255975","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/07\/sigmod16sampleseek.pdf","id":"255975","title":"sigmod16sampleseek","label_id":"243103","label":0},{"type":"doi","viewUrl":"false","id":"false","title":"10.1145\/2882903.2915249","label_id":"243109","label":0}],"msr_related_uploader":"","msr_attachments":[],"msr-author-ordering":[{"type":"user_nicename","value":"Bolin Ding","user_id":31274,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Bolin Ding"},{"type":"text","value":"Silu Huang","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Surajit Chaudhuri","user_id":33764,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Surajit Chaudhuri"},{"type":"user_nicename","value":"Kaushik Chakrabarti","user_id":32503,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Kaushik Chakrabarti"},{"type":"user_nicename","value":"Chi Wang","user_id":31406,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Chi Wang"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[],"msr_project":[700999],"publication":[],"video":[],"download":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":700999,"post_title":"Sublinear Approximation for Large-scale Data Science","post_name":"sublinear-approximation-for-large-scale-data-science","post_type":"msr-project","post_date":"2020-10-24 11:50:50","post_modified":"2020-10-24 12:11:22","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/sublinear-approximation-for-large-scale-data-science\/","post_excerpt":"One challenge in large scale data science is that even linear algorithms can result in large data processing cost and long latency, which limit the interactivity of the system and the productivity of data scientists. This project has an ambitious goal of enabling data science with sublinear complexity, such that the cost grows slowly or independently with the data size. We explore approximation methods for time-consuming tasks in data analytics and machine learning. Our methods…","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/700999"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/255978","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/255978\/revisions"}],"predecessor-version":[{"id":527015,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/255978\/revisions\/527015"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=255978"}],"wp:term":[{"taxonomy":"msr-content-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-content-type?post=255978"},{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=255978"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=255978"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=255978"},{"taxonomy":"msr-product-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-product-type?post=255978"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=255978"},{"taxonomy":"msr-platform","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-platform?post=255978"},{"taxonomy":"msr-download-source","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-download-source?post=255978"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=255978"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=255978"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=255978"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=255978"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=255978"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=255978"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=255978"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}