[Beowulf] cheap PCs this christmas
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Nov 23 06:37:29 PST 2005
- Previous message: [Beowulf] cheap PCs this christmas
- Next message: [Beowulf] cheap PCs this christmas
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Tue, 22 Nov 2005, Jim Lux wrote: > But how many of those corruptions would have resulted in an error had they > not been caught? And, would you have been able to deal with the potential > errors at a higher level of abstraction? Say you saved enough money by not > buying those extra chips in the ECC memory, so you could buy more nodes AND > you might run a bit faster (depending on the architecture), so you can run > check cases (in mag tape terms: longitudinal parity as opposed to parallel > parity checks). > > Is the cache in your processor ECC? What's the impact on your performance of > cache hit/miss vis a vis ECC and/or bit flips. > > This is all really non-trivial.. > > >> still, for some workloads, especially for leaner facilities (lower memory, >> less budget spent on network and storage), I'd certainly want to consider >> non-ECC. I only wish vendors would publish their FIT figures, so we could >> crunch the numbers properly. > > But FIT numbers (failures per 1E9 hours of operation) usually only apply to > total failures, not to bit error rates, which depend on a lot of > environmental factors (not just altitude, but also surrounding geology, > temperature). On a number of systems I've worked on, almost all the observed > bit errors were eventually attributed to things like timing margins and > electrical transients (ringing on bus lines, coupling from other signals, > etc.), after extensive analysis of the predicted rate of radiation induced > bit flips. What HE said, too. Gee guys, great discussion. Really non-trivial is very definitely the right answer, and yes it can be perfectly sensible to buy non-ECC memory for a cluster depending on its scale and the overall quality of the hardware and infrastructure. I agree with Jim -- I think that a LOT of memory errors result from e.g. overheating inside the box, power supply problems, incorrectly set or just plain buggy BIOS, cheap-ass motherboard hardware. THESE all cost money to "fix" as well, mind you -- proper cluster design is all about CBA -- but I'd argue that one GOOD design feature is to use high quality parts and plug them into a professional-grade cluster infrastructure, ECC or not on the motherboards. I'd argue this only after trying it both ways, mind you -- this is bitter experience speaking here. The HUMAN cost of ONE BAD SERIES of hardware is "greater than you can possibly imagine". So you also need to trade off the cost of getting hardware a vendor is willing to support with 3 year onsite service (where if non-ECC starts throwing errors on some particular node THEY can just plain replace the node) against the cost of getting vanilla OTS whiteboxes, self-maintaining them, and getting ECC memory as ONE of your strategies for keeping them working sort of when really they're broken. Plus radiation. Clusters in Denver and Mexico City are exempted from this, or are trading off ECC costs vs putting the cluster in a nuclear bombshelter. rgb > > >> more to the point, if you're going to network $300 PCs, ECC should almost >> certainly not be on your xmas list... > > Or, rather, you should do ECC by using 11 computers for $300 rather than 8 > computers for $500. > >> __ > > James Lux, P.E. > Spacecraft Radio Frequency Subsystems Group > Flight Communications Systems Section > Jet Propulsion Laboratory, Mail Stop 161-213 > 4800 Oak Grove Drive > Pasadena CA 91109 > tel: (818)354-2075 > fax: (818)393-6875 > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] cheap PCs this christmas
- Next message: [Beowulf] cheap PCs this christmas
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
