Closing the Network Diagnostics Gap with Vigil

  • Behnaz Arzani ,
  • Selim Ciraci ,
  • Luiz Chamon ,
  • Yibo Zhu ,
  • Hongqiang Liu ,
  • ,
  • Geoff Outhredy ,
  • Boon Thau Loo

SIGCOMM Posters and Demos '17, Los Angeles, CA, USA |

Published by ACM - Association for Computing Machinery

Vigil started with an ambitious goal: For every TCP retransmission in our data centers, we wanted to pinpoint the network link that caused the packet drop that triggered the retransmission with negligible diagnostic overhead or changes to the networking infrastructure.

This goal may sound like an overkill—after all, TCP is supposed to be able to deal with a few packet losses. Packet losses might occur due to simple congestion instead of network equipment failures. Even network failures might be transient. Above all, there is a danger of drowning in a sea of data without generating any actionable intelligence.