Multi-Scale Matrix Sampling and Sublinear-Time PageRank Computation
- Christian Borgs ,
- Michael Brautbar ,
- Jennifer Chayes ,
- Shang-Hua Teng
Internet Mathematics 10 |
A fundamental problem arising in many applications in Web science and social network analysis is the problem of identifying all nodes in a network whose PageRank exceeds a given threshold . In this paper, we study the probabilistic version of the problem where given an arbitrary approximation factor c > 1, we are asked to output a set S of nodes such that with high probability, S contains all nodes of PageRank at least , and no node of PageRank smaller than /c. We call this problem Significant PageRanks. We develop a nearly optimal, local algorithm for the problem with runtime complexity ˜O (n/) on networks with n nodes, where the tilde hides a polylogarithmic factor. We show that any algorithm for solving this problem must have runtime of (n/), rendering our algorithm optimal up to logarithmic factors. Our algorithm has sublinear time complexity for applications including Web crawling and Web search that require efficient identification of nodes whose PageRanks are above a threshold = n, for some constant 0 < δ < 1. Our algorithm comes with two main technical contributions. The first is a multi-scale sampling scheme for a basic matrix problem that could be of interest on its own. For us, it appears as an abstraction of a subproblem we need to tackle in order to solve the SignificantPageRanks problem, but we hope that this abstraction will be useful in designing fast algorithms for identifying nodes that are significant beyond PageRank measurements. In the abstract matrix problem it is assumed that one can access an unknown right-stochastic matrix by querying its rows, where the cost of a query and the accuracy of the answers depend on a precision parameter ǫ. At a cost propositional to 1/ǫ, the query will return a list of O(1/ǫ) entries and their indices that provide an ǫ-precision approximation of the row. Our task is to find a set that contains all columns whose sum is at least , and omits any column whose sum is less than /c. Our multi-scale sampling scheme solves this problem with cost ˜O(n/), while traditional sampling algorithms would take time ((n/)2). Our second main technical contribution is a new local algorithm for approximating personalized PageRank, which is more robust than the earlier ones developed in [2, 11] and is highly efficient particularly for networks with large in-degrees or out-degrees. Together with our multiscale sampling scheme we are able to optimally solve the Significant PageRanks problem.