{"id":239195,"date":"2020-02-28T15:32:39","date_gmt":"2016-06-16T19:54:24","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=239195"},"modified":"2020-03-13T16:56:25","modified_gmt":"2020-03-13T23:56:25","slug":"quickr-cost-effective-data-analytics-scale","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/quickr-cost-effective-data-analytics-scale\/","title":{"rendered":"Quickr: Cost-Effective Data Analytics at Scale"},"content":{"rendered":"

We are inundated with data. Resources to analyze the data are finite and expensive. Approximate answers allow us to explore much larger amounts of data than otherwise possible given available resources. Reducing the cost, if doable for a large fraction of the complex queries that run on this data, is of strategic importance because the savings can be re-invested into more sophisticated algorithms or be used as a key differentiator for analytics-as-a-service offerings.<\/p>\n

Unfortunately, state-of-art techniques cannot approximate complex queries. Most production bigdata systems offer the uniform sample operator. The user can sample as desired. But the systems do not reason about how the answer will change. A rich vein of prior research builds samples over input datasets. They deliver benefit to predictable queries that touch only one large dataset. Joins with small dimension tables are okay. However, they cannot help queries that join more than one large table, queries that touch less frequently used datasets or query sets that use a diverse set of columns. Such queries and datasets dominate in bigdata clusters. On the TPC-DS benchmark, our experiments show that when given 1x (4x) the size of the input to store samples, a state-of-the-art apriori sampling technique, BlinkDB, offers benefit for 11% (17%) of the queries.<\/p>\n

Despite statistical sampling being well-understood, there have been no real breakthroughs in offering approximate answers for complex ad-hoc queries. We break new ground in research on this topic:\u00a0 (1) we have discovered new samplers that effectively sample join inputs; (2) by treating samplers as native operators in a query optimizer, we show that one can cover a substantial fraction of complex queries without\u00a0pre-constructed samples; and (3) we extend the state-of-art in reasoning about the accuracy of sampled plans.<\/p>\n

Zero-touch approximation has been a Holy Grail problem. Achieving it for a large subset of U-SQL (SQL + user-defined operators) is a narrow waist for disruption since any query expressible in U-SQL can then be automatically approximated.<\/p>\n

<\/h2>\n
\"State-of-the-art:

State-of-the-art: Apriori or Input Sampling.<\/p><\/div>\n

 <\/p>\n

\"Quickr:

Quickr: Inline samplers injected by the query optimizer.<\/p><\/div>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n

 <\/p>\n","protected":false},"excerpt":{"rendered":"

Quickr explores cost-effective data analytics at scale, breaking new ground in research on offering approximate answers for complex ad-hoc queries.<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"footnotes":""},"research-area":[13561,13556,13552,13554,13560,13555,13558,13547],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-239195","msr-project","type-msr-project","status-publish","hentry","msr-research-area-algorithms","msr-research-area-artificial-intelligence","msr-research-area-hardware-devices","msr-research-area-human-computer-interaction","msr-research-area-programming-languages-software-engineering","msr-research-area-search-information-retrieval","msr-research-area-security-privacy-cryptography","msr-research-area-systems-and-networking","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2016-03-08","related-publications":[242453,354086,370358,386069,451668,480003,580546,619941,686094,688683,238027],"related-downloads":[599259],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[{"id":0,"name":"Related work","content":"There is a large class of prior work. Below, we list a few that we\u00a0built upon\u00a0as we did this work. We are aware that this is a partial list and apologize in advance for omissions.\r\n

    \r\n \t
  • \r\n
    Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches by Cormode, Garofalakis, Haas and Jermaine.<\/div><\/li>\r\n \t
  • \u00a0Optimized stratified sampling for approximate query processing by Chaudhuri, Das and Narasayya.<\/li>\r\n \t
  • BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data by Agarwal, Panda, Mozafari, Madden and Stoica.<\/li>\r\n \t
  • A Sampling Algebra for Aggregate Estimation by Nirkhiwale, Dobra and Jermaine.<\/li>\r\n<\/ul>"},{"id":1,"name":"Details","content":"

    Goals of Quickr<\/h2>\r\nA desirable system for approximating big data queries has three aspects. First, it should offer turnkey support for approximations. That is, given a query, decide whether or not it can be sampled and output the appropriate sampled query plan. Input samples or knowledge of future queries may not be available. Second, the system should support complex queries, i.e., support the large portion of U-SQL. Finally, offer strong guarantees on the accuracy of the answer; for example, with high probability (whp), none of the groups in the answer will be missed and the output value of aggregations is within a bounded ratio of the true answer.\r\n

    Inline sampling: gains without overhead<\/h2>\r\nQuickr samples inline in the query plan\u00a0as shown in the figure on the right. In contrast\u00a0to\u00a0prior work that maintains input samples and matches queries to preexisting samples, it is easy to see that inline sampling\u00a0has little apriori overhead. A key observation in Quickr is that big-data queries perform multiple passes over data in part due to the complexity of the queries and in part due to the nature of parallel plans; the median query in Cosmos has 2.4 effective\u00a0passes over data (90th percentile value is 6.5). Hence, inline sampling also has the potential for sizable gains.\r\n

    Cover more queries<\/h2>\r\nMany queries that appear un-approximable for input samples can be sped up by inline sampling. Consider a query that touches many columns;\u00a0stratifying\u00a0on\u00a0many columns leads to an input sample\u00a0that is\u00a0as large as the input and hence such queries would not benefit from input\u00a0samples.\u00a0However, Quickr can place a sampler after the\u00a0selections or joins whose (complex) predicates contributed many of these columns. If the selects are pushed down to the first parallel pass on data (as they often are), samplers will reduce data written by the first pass speeding up all downstream operations.\r\n

    Sampling multiple inputs of a join<\/h2>\r\nAnother novel aspect that lets Quickr cover many more queries than prior methods is a universe sampler that samples multiple join inputs. It is well known that joining a p probability sample of inputs is akin to a p^2 probability sample of the join output. Hence, sampling the join inputs improves performance at the cost of substantial degradation in answer quality. On the other hand, when queries join large inputs, sampling the join output offers limited speed-up.\r\n\r\nWith the universe sampler, joining a p probability sample of inputs is statistically equivalent to a p probability universe sample of the join output. The key idea is to consistently sample the join inputs without any coordination (such as exchanging histograms of join keys). We will show that the universe sampler is applicable broadly, i.e., it supports multiple equijoins and only requires the group-by columns and the value of aggregates to be uncorrelated with the sampled column set.\r\n

    Sampling + QO<\/h2>\r\nQuickr offers turn-key support for approximations by picking for every newly arriving query an execution plan with samplers. There are at least two choices as to how we can obtain a good plan that contains samplers: (a) Insert samplers a posteriori into a plan that is output by a traditional relational query optimizer or (b) Incorporate samplers as first-class operators along with the other database operators and explore the larger combined space of possible plans within a query optimizer. Notice that option (b) can yield plans that cannot be obtained from using option (a). For example, when a sampler reduces cardinality, downstream join operations can be implemented differently and more efficiently as a cross join instead of a hash-join. As another example, for queries with many joins and selects, option (a) may offer a plan on which all simple edits to insert samplers appear infeasible (inaccurate). Yet, a different ordering of the joins or selects may allow samplers to be inserted. Hence, we chose option (b); we offer a new algorithm that incorporates samplers as native operators into a Cascades-style cost-based query optimizer.\r\n

    Reasoning about the accuracy of sampled expressions<\/h2>\r\nOur method transforms a query expression with arbitrarily many samplers to an expression with one sampler at the root. In particular, we generalize prior work that only considered SUM-like aggregates to the case where answers can have groups and multiple aggregations of various kinds. We also generalize the method to a broader class of samplers that are not uniformly random.\r\n\r\nFurthermore, we compute\u00a0unbiased estimators of the aggregations and various error measures (such as variance, confidence interval etc.)\u00a0 in one effective pass over data whereas in general error bounds require a self-join over samples or bootstrap which involves hundreds or thousands of trials on the sample.\r\n

    Sampling dominance<\/h3>\r\nWe introduce a novel notion of dominance:\u00a0given two query expressions E1 and E2 whose output is identical when samplers are removed, E1 is said to dominate E2, iff the accuracy of E1 is no worse than that of E2. We use three ideas to establish dominance for a rich class of samplers. First, any uniformly random sampler convolves with all database operations in the sense that there exists a corresponding sampled expression which has exactly the same accuracy (variance and expectation). Second, any sampler that is strictly more likely to pass a row relative to some uniformly random sampler is analyzable in that its accuracy can be bounded from below. Finally, we show a special-case set of samplers that are not uniformly random but do convolve with all database operators. The first idea is borrowed from prior work but the other two are new to the best of our knowledge.\r\n\r\n \r\n\r\n\"An\r\n\r\nAn example script that is otherwise unapproximable. Input sampling would choose to stratify each of the fact tables (store sales etc.) on {item_sk, date_sk, customer_sk} so as to not miss groups in the answer but the triple has almost as many distinct items as the original dataset. Quickr would place our new universe sampler on all three fact tables.\r\n\r\n \r\n\r\n \r\n\r\n\"Quickr\r\n\r\n \r\n\r\n\"Pushing\r\n\r\nPushing samplers past a join operation.\r\n\r\n\"A\r\n\r\nA transformation rule that generates alternative plans with the sampler moved to before a selection."},{"id":2,"name":"Status","content":"

    Prototype and status<\/h2>\r\nWe have built a prototype in Cosmos as an initial step towards these goals. On queries from the TPC-DS benchmark, Quickr reduces the median resource usage by 2x; no groups are missed and 80% of the aggregates are within +-10% of their true value."}],"slides":[],"related-researchers":[{"type":"user_nicename","display_name":"Surajit Chaudhuri","user_id":33764,"people_section":"Section name 0","alias":"surajitc"},{"type":"user_nicename","display_name":"Srikanth Kandula","user_id":33707,"people_section":"Section name 0","alias":"srikanth"},{"type":"user_nicename","display_name":"Kukjin Lee","user_id":32593,"people_section":"Section name 0","alias":"kulee"}],"msr_research_lab":[199565],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/239195"}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":9,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/239195\/revisions"}],"predecessor-version":[{"id":643098,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/239195\/revisions\/643098"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=239195"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=239195"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=239195"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=239195"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=239195"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}