Empirical Measurements of Disk Failure Rates and Error Rates
The SATA advertised bit error rate of one error in 10 terabytes is frightening. We moved 2 PB through low-cost hardware and saw five disk read error events, several controller failures, and many system reboots caused by security patches. We conclude that SATA uncorrectable read errors are not yet a dominant system-fault source they happen, but are rare compared to other problems. We also conclude that UER (uncorrectable error rate) is not the relevant metric for our needs. When an uncorrectable read error happens, there are typically several damaged storage blocks (and many uncorrectable read errors.) Also, some uncorrectable read errors may be masked by the operating system. The more meaningful metric for data architects is Mean Time To Data Loss (MTTDL.)