[Beowulf] cheap PCs this christmas

Wed Nov 23 06:37:29 PST 2005

On Tue, 22 Nov 2005, Jim Lux wrote:

> But how many of those corruptions would have resulted in an error had they 
> not been caught?  And, would you have been able to deal with the potential 
> errors at a higher level of abstraction? Say you saved enough money by not 
> buying those extra chips in the ECC memory, so you could buy more nodes AND 
> you might run a bit faster (depending on the architecture), so you can run 
> check cases (in mag tape terms: longitudinal parity as opposed to parallel 
> parity checks).
>
> Is the cache in your processor ECC?  What's the impact on your performance of 
> cache hit/miss vis a vis ECC and/or bit flips.
>
> This is all really non-trivial..
>
>
>> still, for some workloads, especially for leaner facilities (lower memory,
>> less budget spent on network and storage), I'd certainly want to consider
>> non-ECC.  I only wish vendors would publish their FIT figures, so we could
>> crunch the numbers properly.
>
> But FIT numbers (failures per 1E9 hours of operation) usually only apply to 
> total failures, not to bit error rates, which depend on a lot of 
> environmental factors (not just altitude, but also surrounding geology, 
> temperature).  On a number of systems I've worked on, almost all the observed 
> bit errors were eventually attributed to things like timing margins and 
> electrical transients (ringing on bus lines, coupling from other signals, 
> etc.), after extensive analysis of the predicted rate of radiation induced 
> bit flips.

What HE said, too.  Gee guys, great discussion.  Really non-trivial is
very definitely the right answer, and yes it can be perfectly sensible
to buy non-ECC memory for a cluster depending on its scale and the
overall quality of the hardware and infrastructure.  I agree with Jim --
I think that a LOT of memory errors result from e.g. overheating inside
the box, power supply problems, incorrectly set or just plain buggy
BIOS, cheap-ass motherboard hardware.  THESE all cost money to "fix" as
well, mind you -- proper cluster design is all about CBA -- but I'd
argue that one GOOD design feature is to use high quality parts and plug
them into a professional-grade cluster infrastructure, ECC or not on the
motherboards.  I'd argue this only after trying it both ways, mind you
-- this is bitter experience speaking here.  The HUMAN cost of ONE BAD
SERIES of hardware is "greater than you can possibly imagine".  So you
also need to trade off the cost of getting hardware a vendor is willing
to support with 3 year onsite service (where if non-ECC starts throwing
errors on some particular node THEY can just plain replace the node)
against the cost of getting vanilla OTS whiteboxes, self-maintaining
them, and getting ECC memory as ONE of your strategies for keeping them
working sort of when really they're broken.

Plus radiation.  Clusters in Denver and Mexico City are exempted from
this, or are trading off ECC costs vs putting the cluster in a nuclear
bombshelter.

   rgb

>
>
>> more to the point, if you're going to network $300 PCs, ECC should almost
>> certainly not be on your xmas list...
>
> Or, rather, you should do ECC by using 11 computers for $300 rather than 8 
> computers for $500.
>
>> __
>
> James Lux, P.E.
> Spacecraft Radio Frequency Subsystems Group
> Flight Communications Systems Section
> Jet Propulsion Laboratory, Mail Stop 161-213
> 4800 Oak Grove Drive
> Pasadena CA 91109
> tel: (818)354-2075
> fax: (818)393-6875
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu