statisticsphysicsprobabilityerror-detectionrisk-analysis

Cosmic Rays: what is the probability they will affect a program?


Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that probability is.

"Since 2-128 is 1 out of 340282366920938463463374607431768211456, I think we're justified in taking our chances here, even if these computations are off by a factor of a few billion... We're way more at risk for cosmic rays to screw us up, I believe."

Is this programmer correct? What is the probability of a cosmic ray hitting a computer and affecting the execution of the program?


Solution

  • From Wikipedia:

    Studies by IBM in the 1990s suggest that computers typically experience about one cosmic-ray-induced error per 256 megabytes of RAM per month.[15]

    This means a probability of 3.7 × 10-9 per byte per month, or 1.4 × 10-15 per byte per second. If your program runs for 1 minute and occupies 20 MB of RAM, then the failure probability would be

                     60 × 20 × 1024²
    1 - (1 - 1.4e-15)                = 1.8e-6 a.k.a. "5 nines"
    

    Error checking can help to reduce the aftermath of failure. Also, because of more compact size of chips as commented by Joe, the failure rate could be different from what it was 20 years ago.