[Beowulf] cheap PCs this christmas

Jim Lux James.P.Lux at jpl.nasa.gov
Tue Nov 22 22:01:14 PST 2005

At 08:58 PM 11/22/2005, Mark Hahn wrote:

> > Honestly, I never knew that not using ECC RAM on anything besides a
> > nonessential system like a standard desktop configuration was ever an
> > option.
>I find that the use of "nonessential" often indicates rather poor reasoning
>about the risks (and costs) involved.  a statistically-grounded approach
>would treat memory size and perhap activity more than whether something is
>"desktop" or "server".

Indeed.. the person who prepares the budgets for your paychecks probably 
does that on a desktop machine, likewise, the person who's dealing with 
your insurance claim, handling your tax return, etc.

It's all a tradeoff between the liklihood of the "bad luck", the 
probability of detecting it, and the cost of either fixing it (if detected) 
or suffering the consequences of not fixing it.

I work with systems for which, literally, "failure is not an 
option"  (actually, we call that criticality=1.. loss of life or 
mission).  ECC is sometimes a non-starter for this, amazingly, because it's 
not reliable enough... you want some other approach that cross checks and 
is "fail safe".  I also work with systems where there's some sort of spec 
that says, you can screw up some fraction of the time, and for that, you 
make an analysis of whether the increased probability of total failure 
(because you have more parts in any sort of ECC, parity, or redundancy 
scheme) is traded off against the decreased probability of uncorrected errors.

>that said, our servers all have ECC.  on our current ~500 cpus and ~800GB,
>I'd guess we see O(10) corruptions/year.  going to 7500 cores and >14TB,
>(all with ECC) I'm pretty happy not to be risking undetected corruptions.

But how many of those corruptions would have resulted in an error had they 
not been caught?  And, would you have been able to deal with the potential 
errors at a higher level of abstraction? Say you saved enough money by not 
buying those extra chips in the ECC memory, so you could buy more nodes AND 
you might run a bit faster (depending on the architecture), so you can run 
check cases (in mag tape terms: longitudinal parity as opposed to parallel 
parity checks).

Is the cache in your processor ECC?  What's the impact on your performance 
of cache hit/miss vis a vis ECC and/or bit flips.

This is all really non-trivial..

>still, for some workloads, especially for leaner facilities (lower memory,
>less budget spent on network and storage), I'd certainly want to consider
>non-ECC.  I only wish vendors would publish their FIT figures, so we could
>crunch the numbers properly.

But FIT numbers (failures per 1E9 hours of operation) usually only apply to 
total failures, not to bit error rates, which depend on a lot of 
environmental factors (not just altitude, but also surrounding geology, 
temperature).  On a number of systems I've worked on, almost all the 
observed bit errors were eventually attributed to things like timing 
margins and electrical transients (ringing on bus lines, coupling from 
other signals, etc.), after extensive analysis of the predicted rate of 
radiation induced bit flips.

>more to the point, if you're going to network $300 PCs, ECC should almost
>certainly not be on your xmas list...

Or, rather, you should do ECC by using 11 computers for $300 rather than 8 
computers for $500.


James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875

More information about the Beowulf mailing list