{"id":447117,"date":"2017-12-08T15:34:33","date_gmt":"2017-12-08T23:34:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?p=447117"},"modified":"2017-12-08T15:34:33","modified_gmt":"2017-12-08T23:34:33","slug":"collecting-telemetry-data-privately","status":"publish","type":"post","link":"https:\/\/www.microsoft.com\/en-us\/research\/blog\/collecting-telemetry-data-privately\/","title":{"rendered":"Collecting telemetry data privately"},"content":{"rendered":"

\"\"<\/p>\n

The collection and analysis of telemetry data from users and their devices leads to improved user experiences and informed business decisions. However, users have concerns about their data privacy, including what personal information software and internet companies are gathering and whether their data is protected from potential leaks and hacks.<\/p>\n

Differential privacy<\/em><\/a> (Dwork, et al. 2006<\/a>) has emerged as the de facto standard for privacy guarantees. A special case (and the strongest variant) of differential privacy, called the local model<\/em>, can be employed to address these concerns when collecting telemetry data. In the local model of differential privacy (LDP), each user holds a private telemetry value representing a characteristic such as time spent using a piece of software.<\/p>\n

In the LDP model, instead of collecting xi<\/sub><\/em>\u00a0directly, the data collector first runs a randomized algorithm A<\/em>\u00a0to encode\u00a0xi<\/sub><\/em> on the user\u2019s device<\/em>, and then collects the output Yi<\/sub> = A(xi<\/sub>).<\/em> LDP collection algorithm A<\/em>\u00a0needs to guarantee a type of plausible deniability: no matter what output A(x)<\/em>\u00a0is collected from a user, it is approximately equally as likely to have come from any specific value x<\/em>\u00a0as any other x’<\/em>. Somewhat surprisingly, while observing values\u00a0Yi<\/sub> = A(xi<\/sub>)<\/em>\u00a0coming from a large population of users, a data collector can tell almost nothing about any specific user, but can see general trends in the population, such as the mean, histogram, or other aggregates of users\u2019 values.<\/p>\n

\"\"

Figure 1. Data collection and analytics under the LDP model<\/em><\/p><\/div>\n

One example of such LDP collection algorithms in our NIPS 2017 paper<\/a> is the following 1-bit algorithm:<\/p>\n

\"\"<\/p>\n

Here a user with value xi\u00a0<\/sub><\/em>sends to the data collector\u00a0Yi<\/sub><\/em> = 1\u00a0with probability that increases with\u00a0xi<\/sub><\/em>\u00a0\/M\u00a0<\/em>(assuming\u00a0xi<\/sub><\/em>\u00a0takes value from 0 to M<\/em>) and Yi<\/sub><\/em> = 0 otherwise. The mean of {x1<\/sub>, x2<\/sub>, …, xN<\/sub><\/em>}\u00a0 can be estimated as<\/p>\n

\"\"<\/p>\n

The data collection algorithm A<\/em> above satisfies \u2208-local differential privacy (\u2208<\/em>-LDP) because, for any two values x<\/em>, x’<\/em> and \u2208<\/em> [0, M<\/em>] and b<\/em> \u2208<\/em> {0, 1}, we have Pr[A(x) = b<\/em>] \u2264 e<\/em>\u2208<\/sup> \u22c5 Pr[A(x’) = b<\/em>].<\/p>\n

The \u2208<\/em>-LDP algorithm provides strong privacy guarantees for one-round data collection when \u2208<\/em>\u00a0is small, e.g.,\u00a0\u2208<\/em> = 1. Note that the smaller\u00a0\u2208<\/em> is, the stronger the privacy guarantee provided, and the larger error in the resulting m \u00a0estimation. However, in practice, telemetry is typically collected continuously. What if, from each user, we want to collect the telemetry value for T<\/em> continuous time stamps? We could apply the above algorithm A<\/em>\u00a0at each time stamp independently; however, based on the sequential composition<\/a> of differential privacy, we end up with providing (T<\/em>\u00a0\u22c5 \u2208<\/em>)-LDP to each user. When T = 100, (100\u00a0\u22c5\u00a0\u2208)<\/em>\u00a0LDP is usually too weak to be a reasonable guarantee of privacy.<\/p>\n

While memoization and instantaneous noise are useful techniques to protect consistent users in the face of repeated data collection (initially adopted by Erlingsson, et al. in RAPPOR<\/a> for collecting strings under LDP), their efficacy depends on\u00a0the fact that individual users take a constant number of distinct private values. Note that in our problem private values are real numbers.\u00a0Thus, it is unreasonable to expect users to take consistent values across multiple rounds of data collection. In reality, users tend to be only approximately consistent: for example, software usage may vary by 30 minutes per day on most days. This leads to the following technical question:<\/p>\n

How do we discretize continuous values so that we protect approximately consistent users while keeping the size of the memoization table small and losing no accuracy due to this discretization step?<\/p>\n

Clearly, any deterministic discretization rule will incur some accuracy loss. Moreover, even if one is willing to tolerate errors due to rounding, such a rule would lead to a large memoization table, which may be difficult to implement in practice. We address this by making the rounding rule randomized.<\/p>\n

\"\"

Figure 2. Rounding and memoization for repeated data collection<\/em><\/p><\/div>\n

Our memoization scheme relies on \u03b1<\/em>-point rounding\u2014a technique that is used quite extensively in approximation algorithms for rounding linear programs. In this scheme, at the setup phase, every user\u00a0draws value\u00a0\u03b1<\/em>\u00a0from a uniform distribution over [0, S<\/em>];\u00a0here S<\/em> is the bucket size of discretization. We memoize output of our 1-bit algorithm A<\/em> for values k\u00a0\u22c5 S, <\/em>where k<\/em> = 0,1,2, …, M\/S<\/em> k<\/em>\u00a0\u22c5 S<\/em>\u00a0is a boundary between two consecutive buckets. At data collection, if the user takes a value\u00a0xi<\/sub><\/em> such that k<\/em>\u00a0\u22c5 S <\u00a0xi\u00a0\u00a0<\/sub><\/em>+\u00a0\u03b1<\/em> \u00a0\u2264 (k<\/em> + 1) \u22c5 S<\/em>, then the user responds based the value k\u00a0\u22c5 S<\/em>\u00a0by looking up the memoization table.<\/p>\n

Note that in our scheme, we also memoize the value of \u03b1<\/em>, hence if a user is approximately consistent (the user\u2019s values lie in a range with width\u00a0S<\/em>), then the output of our algorithm never changes or varies between two different values corresponding to neighboring buckets. Our rounding scheme also satisfies another desirable property: even with a discretization step, no matter how large the discretization size is, it introduces no further accuracy loss.<\/p>\n

The ideas outlined here allow us to use memoization and instantaneous noise when collecting counter data. Our NIPS paper<\/a> contains many more details, including discussion of privacy guarantees for inconsistent users, simultaneous collection of multiple counters, and more.<\/p>\n

Related<\/strong>:<\/p>\n