One challenge in large scale data science is that even linear algorithms can result in large data processing cost and long latency, which limit the interactivity of the system and the productivity of data scientists. This project has an ambitious goal of enabling data science with sublinear complexity, such that the cost grows slowly or independently with the data size. We explore approximation methods for time-consuming tasks in data analytics and machine learning. Our methods are based on data sampling, which are very easy to implement. The results have probabilistic error bounds in theory and hold up to the premise in practice. Compared to the exact solutions, the approximate solutions have 10-1000 speedup within small error bound in typical workloads – and the performance gain amplifies as the data size grows.
Examples of tasks enabled or supported by our theory:
- AutoML. We discover an algorithm-independent method to compute the confidence interval for a model’s accuracy by only using a subsample of the training data. It is different from the generalization bound typically studied in learning theory, and has practical usage in achieving orders of magnitude speedup in model selection. We also develop a fast entropy approximation method for feature selection.
- Network embedding (for billion-edge graphs, based on random walk matrix factorization). We prove a new matrix concentration bound which gives the first sample complexity result for graph representation learning. And we discover a provable graph coarsening method.
- Attribute Recommendation (for data exploration). We prove a probabilistic bound of sample-based approximation of L1 or L2 distance, and use it for a near-optimal algorithm of selecting top-K attributes to distinguish two ad-hoc subsets of data.
- Outlier detection (in multi-armed bandits). The outlier arms are different from top-K arms as K is unknown a priori. We define outlier arms based on k-sigma rule, and prove the sample complexity of two simple algorithms only depends on the distribution of the mean rewards of the arms.
- Interactive visual analytics for big data. We propose approximate query processing techniques with optimal error guarantee for group-by queries in data warehouse.