[Beowulf] cheap PCs this christmas
James.P.Lux at jpl.nasa.gov
Tue Nov 22 22:01:14 PST 2005
At 08:58 PM 11/22/2005, Mark Hahn wrote:
> > Honestly, I never knew that not using ECC RAM on anything besides a
> > nonessential system like a standard desktop configuration was ever an
> > option.
>I find that the use of "nonessential" often indicates rather poor reasoning
>about the risks (and costs) involved. a statistically-grounded approach
>would treat memory size and perhap activity more than whether something is
>"desktop" or "server".
Indeed.. the person who prepares the budgets for your paychecks probably
does that on a desktop machine, likewise, the person who's dealing with
your insurance claim, handling your tax return, etc.
It's all a tradeoff between the liklihood of the "bad luck", the
probability of detecting it, and the cost of either fixing it (if detected)
or suffering the consequences of not fixing it.
I work with systems for which, literally, "failure is not an
option" (actually, we call that criticality=1.. loss of life or
mission). ECC is sometimes a non-starter for this, amazingly, because it's
not reliable enough... you want some other approach that cross checks and
is "fail safe". I also work with systems where there's some sort of spec
that says, you can screw up some fraction of the time, and for that, you
make an analysis of whether the increased probability of total failure
(because you have more parts in any sort of ECC, parity, or redundancy
scheme) is traded off against the decreased probability of uncorrected errors.
>that said, our servers all have ECC. on our current ~500 cpus and ~800GB,
>I'd guess we see O(10) corruptions/year. going to 7500 cores and >14TB,
>(all with ECC) I'm pretty happy not to be risking undetected corruptions.
But how many of those corruptions would have resulted in an error had they
not been caught? And, would you have been able to deal with the potential
errors at a higher level of abstraction? Say you saved enough money by not
buying those extra chips in the ECC memory, so you could buy more nodes AND
you might run a bit faster (depending on the architecture), so you can run
check cases (in mag tape terms: longitudinal parity as opposed to parallel
Is the cache in your processor ECC? What's the impact on your performance
of cache hit/miss vis a vis ECC and/or bit flips.
This is all really non-trivial..
>still, for some workloads, especially for leaner facilities (lower memory,
>less budget spent on network and storage), I'd certainly want to consider
>non-ECC. I only wish vendors would publish their FIT figures, so we could
>crunch the numbers properly.
But FIT numbers (failures per 1E9 hours of operation) usually only apply to
total failures, not to bit error rates, which depend on a lot of
environmental factors (not just altitude, but also surrounding geology,
temperature). On a number of systems I've worked on, almost all the
observed bit errors were eventually attributed to things like timing
margins and electrical transients (ringing on bus lines, coupling from
other signals, etc.), after extensive analysis of the predicted rate of
radiation induced bit flips.
>more to the point, if you're going to network $300 PCs, ECC should almost
>certainly not be on your xmas list...
Or, rather, you should do ECC by using 11 computers for $300 rather than 8
computers for $500.
James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
More information about the Beowulf