network-programmingsystemdowntime

How to calculate network system downtime


Here are two systems, A and B. How to calculate the downtime of each.

For A, should it be: 0.01 * 10 * 6 * 12 = 7.2 hours/year?

A system has 10 physical nodes, if any of those nodes failed, the whole system go down. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.

B system has 10 physical nodes, if 9 out of 10 nodes is running the whole system can function as normal. The probability of failure for a individual node is 1% per month, and the downtime is 6h for fixing. Then what is the downtime for the whole system per year.


Solution

  • We are talking about expected downtimes here, so we'll have to take a probabalistic approach.

    We can take a Poisson approach to this problem. The expected failure rate is 1% per month for a single node, or 120% (1.2) for 10 nodes in 12 months. So you are correct that 1.2 failures/year * 6 hours/failure = 7.2 hours/year for the expected value of A.

    You can figure out how likely a given amount of downtime is by using 7.2 as the lambda value for the poisson distribution.

    Using R: ppois(6, lambda=7.2) = 0.42, meaning there is a 42% chance that you will have less than 6 hours of downtime in a year.

    For B, it's also a Poisson, but what's important is the probability that a second node will fail in the six hours after the first failure.

    The failure rate (assuming a 30 day month, with 120 6 hour periods) is 0.0083% per 6 hour period per node.

    So we look at the chances of two failures within six hours, times the number of six hour periods in a year.

    Using R: dpois(2.0, lambda=(0.01/120)) * 365 * 4 = 0.000005069

    0.000005069 * 3 expected hours/failure = 54.75 milliseconds expected downtime per year. (3 expected hours per failure because the second failure should occur on average half way through the first failure.)