erlangreliabilityuptimedowntime

Erlang's 99.9999999% (nine nines) reliability


Erlang was reported to have been used in production systems for over 20 years with an uptime percentage of 99.9999999%.

I did the math as the following:

20*365.25*24*60*60*(1 - 0.999999999) == 0.631 s

That means the system only has less than one second of downtime during the period of 20 years. I am not trying to challenge the validity of this, I am just curious about how we can shut down a system (on purpose or by accident) for only 0.631 second. Could anyone who are familiar with large software system explain this to us? Thank you.


Does anyone know how to calculate the downtime of a service over a cluster of processing units (or machines)?


Solution

  • The reliability figure wasn't supposed to measure the total time any part of AXD301 (project in question) was ever shut down for over 20 years. It represents the total time over those 20 years that the service provided by the AXD301 system was ever offline. Subtle difference. As Joe Armstrong says here:

    The AXD301 has achieved a NINE nines reliability (yes, you read that right, 99.9999999%). Let’s put this in context: 5 nines is reckoned to be good (5.2 minutes of downtime/year). 7 nines almost unachievable ... but we did 9.

    Why is this? No shared state, plus a sophisticated error recovery model.

    If you dig a bit deeper, in the PhD thesis written by Joe, the original author of Erlang (which includes a case study of AXD301), you read:

    One of the projects studied in this chapter is the Ericsson AXD301, a high-performance highly-reliable ATM switch.

    So, as long as the network that the switch was a part of was running without downtime, the author can state "nine nines reliability" for AXD301 (which was all he ever said, avoiding specifics). It doesn't necessarily mean Erlang is the only cause of such high reliability.

    EDIT: In fact, "20 years" itself seems like a misinterpretation. Joe mentions a figure of 20 years in the same article, but it's not actually connected to the nine-nines reliability figure, which potentially came out of a much shorter study (as others have mentioned).