Microsoft Azure cloud hosts a wide variety of services, and Azure has hundreds of network edge locations worldwide across the globe’s six continents to host those services. The Azure locations host many interactive (latency-sensitive) services that cater to consumer and enterprise clients covering a broad set of products around productivity, search, communications, and storage. These Azure edge locations are the first stop in the Microsoft network that customers hit, and with edge locations spread out worldwide, customers all over can reach Microsoft with low latency. Hundreds of millions of clients use the services on Azure every single day.
Plenty of prior studies have shown the precipitous fall in user engagement with increasing latency. But we don’t have to read studies to understand how important low latency is in our daily lives—we only need to start a video call with someone on our phones or computers. The importance of low latency and round-trip time (RTT) becomes especially evident when latency is high—the glitches and lag in the audio or video make it impossible to have a natural conversation. In fact, our own past work with Skype has studied the importance of the network for good user experience.
The example above illustrates the importance of low latency in the network, and it also shows how, when there are inevitable slow-downs in the network, the system must be able to identify the problem and recover as quickly as possible. This is where BlameIt technology comes in. In real time, BlameIt endeavors to precisely identify where, in the pathway from client to cloud and back to client, there are issues in individual autonomous systems (AS or ASes) along the way. In our SIGCOMM 2019 paper, “Zooming in on Wide-area Latencies to a Global Cloud Provider,” we show how BlameIt works to identify these faulty ASes. The work is a result of multiple years of collaboration between Microsoft Research and Azure Networking.
Understanding client-cloud latency and the role of autonomous systems
Users either reach Azure via direct peering or via multiple autonomous systems, sometimes colloquially referred to by the term “hops.” While the spread of Azure locations means that there are fewer ASes between the client AS and the cloud, when the latency experienced by Azure clients increases (that is, it breaches the RTT targets), we want to know which AS is the culprit, or more precisely, which AS in the path is causing the latency degradation.
Localizing the faulty AS is crucial! There could be many reasons for RTT degradation: overloaded cloud servers to congested cloud networks, maintenance issues in the client’s ISP, or path updates inside a transit AS. A tool to localize the faulty AS, as quickly as possible, will help Azure’s operators trigger remedial actions such as switching egress peers, thereby minimizing the duration of user impact. But the key is the accurate and timely localization of the faulty AS in the client-cloud path. It is important to note that with the rapid improvement in performance of intra-DC and inter-DC networks over the years, the client-cloud communication over the public Internet has become the weak link for cloud services.
Hold on, hasn’t the research community been at this problem for a while?
Yes indeed! The problem of breaking down end-to-end Internet performance into the contributions of each AS has been of long-standing interest to the Internet measurement community. Prior solutions can be bucketed into passive and active categories (see Figure 2). At the core of both categories lies a problem: overcoming the challenge of the Internet’s autonomous structure that is not controlled by any single entity.
Passive techniques rely on using the end-to-end latency measurements and solving for the latency contributions of individual ASes (for example, network tomography using linear equations) but often run into under-constrained equations in the wild due to lack of insufficient coverage of all the Internet paths. Another class of techniques relies on continually issuing active probes from vantage points in the Internet and can have simply too much measurement overhead.
BlameIt: Employing a hybrid two-phased blame assignment
In BlameIt we adopt a two-phased approach, combining the best parts of passive analysis (low measurement overhead) and active probing (fine-grained fault localization).
Phase 1: While the passively collected TCP handshake data alone is not sufficient to pinpoint the faulty AS, the data is enough for BlameIt to narrow the fault into either “cloud,” “middle,” or “client” AS segments.
Phase 2: Only when the fault is one of the middle AS, BlameIt issues active traceroutes from the Azure Front Door locations, also only for high-priority latency degradations.
How did BlameIt avoid the pitfalls of prior solutions? At a high level, BlameIt leverages two main empirical characteristics. First, only one AS in the path is usually at fault, not multiple ASes failing simultaneously. Second, a smaller “failure set” is more likely than a larger set. For example, if all clients connecting to a cloud location see bad RTTs, it is likely the cloud’s fault (and not due to all the clients being bad simultaneously).
BlameIt is working in Azure right now!
Azure’s cloud locations generate RTT streams that are continuously collected and aggregated at an analytics cluster where the BlameIt script runs periodically. Its outputs trigger prioritized alerts to operators and targeted traceroutes to clients. The detailed outputs of BlameIt are also provided to the operators for ease of investigation.
We compare the accuracy of BlameIt’s result (the blamed AS) to 88 incidents that had manual reports and find that it correctly localizes the fault in all the incidents! In addition, BlameIt’s selective active probing results in nearly two orders of magnitude fewer traceroutes than a solution that relies on active probing alone.
With BlameIt currently in action, we hope to move beyond identifying just latency and start focusing on also identifying culprit AS when there is network packet loss or degradation in bandwidth. We also want to investigate the impact of the higher-level service on BlameIt’s functioning and tune it accordingly.