{"id":700999,"date":"2020-10-24T11:50:50","date_gmt":"2020-10-24T18:50:50","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&p=700999"},"modified":"2020-10-24T12:11:22","modified_gmt":"2020-10-24T19:11:22","slug":"sublinear-approximation-for-large-scale-data-science","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/sublinear-approximation-for-large-scale-data-science\/","title":{"rendered":"Sublinear Approximation for Large-scale Data Science"},"content":{"rendered":"

One challenge in large scale data science is that even linear algorithms can result in large data processing cost and long latency, which limit the interactivity of the system and the productivity of data scientists. This project has an ambitious goal of enabling data science with sublinear complexity, such that the cost grows slowly or independently with the data size. We explore approximation methods for time-consuming tasks in data analytics and machine learning. Our methods are based on data sampling, which are very easy to implement. The results have probabilistic error bounds in theory and hold up to the premise in practice. Compared to the exact solutions, the approximate solutions have 10-1000 speedup within small error bound in typical workloads \u2013 and the performance gain amplifies as the data size grows.<\/p>\n

Examples of tasks enabled or supported by our theory:<\/p>\n