[Beowulf] Not quite Walmart, or, living without ECC?

Robert G. Brown rgb at phy.duke.edu
Fri Nov 16 04:34:57 PST 2007


On Thu, 15 Nov 2007, David Mathog wrote:

> There are some pretty good deals in the low end of the mother board
> and CPU ranges right now.  Not what you folks would buy, but something
> I'd consider to replace the old Athlon MP's in our 2U cases, one of
> which just blew up (or the Tyan motherboard, it hardly matters as I
> don't have spares for either part).  It looks like one can buy

Ah, but I do...

Yessir, genuine 2466s.  Even have a few spare CPUs.  And here I am,
trying to get myself to throw them away and thereby clean up my office,
since at this point one can buy...

> a dual core Athlon64, 1 Gb of memory, 1G Lan, and low end VGA on a 
> consumer motherboard for around $150.  Maybe less. With the recycled

...which is IIRC less than just one of the Athlon CPUs alone cost.  Sigh.

And the 2466 sucks.  Well, sucked.  But if you WANT them and will pay
for shipping and are willing to add to your already extensive beer-debt
on the rewiring of your house, I'd be happy to ship you what I've got,
no guarantees.  CPUs still packaged, motherboards may have been removed
from packaging but I have no reason to think they don't/won't work.

> case, fans, PS, and disks that would be an inexpensive way to more
> than resuscitate the dead node(s).
>
> The one thing that I don't see cheap anywhere is ECC RAM and
> motherboards that support it.
>
> Any of you running clusters without ECC?  Has the lack of error
> correction been a problem?

A very good question.  However, as always with systems, one that is very
hard to answer without ECC.  A single byte somewhere in your system
flips a bit.  A 0D turns into a 8D.  If it is in the middle of a
computation unpredictable things occur.  Maybe the process crashes,
maybe a loop executes a few more times than it should and you get wrong
answers.  Maybe the answers are egregiously, obviously, horribly wrong,
maybe they are subtly wrong, off by a tiny bit.  Maybe the bit is in the
middle of kernelspace and the system dies horrible almost immediately.
maybe it is in the middle of free memory and nothing happens, or cached
library pages.

All you see is the symptom, however.  But systems DO crash.  Sometimes
from a bit flip, I suppose.  Sometimes from a deep bug.  Sometimes
because they've reached a level of complexity that makes them as "alive"
and self-willed as, say, a flatworm or ant or something, and with life
comes perversity (do I talk to my computers and try to make them feel
welcome and content?  I do...).  And when they've crashed, well, it's
hard to say why they crashed.  They're crashed, after all.  Sometimes
they are kind enough to print out a message as they crash saying "Oops.
I've just lost my mind.  Please look at the following list of nearly
incomprehensible numbers and then kick me in the head."  I do my duty
and look at those numbers, but rarely am I able to put my finger on byte
23 and say "Aha!  That 8D should be a 0D!  I must have suffered a Bit
Flip!".

Besides, more often it just dies, silently and without reprieve or data
to retrieve.

Not often, though.  Given that laptops don't count -- too much going on
with networking bopping up and down and 2nd Life's buggy client locking
up my entire system with the whiteout screen of death (hadn't seen THAT
for a while) -- I still see linux on non-ECC systems being awesomely
stable.  Awesomely stable on relatively small collections of boxes,
however, might not translate to awesome stability on 1024 node clusters
- small numbers on one might become annoying numbers on the other.

ECC machines do report the errors that they correct, IIRC, at least
sometimes.  I don't know that I trust them, though, as predictors of
non-ECC error rates.  If they were, I'd expect more problems than I
actually see, although I admit that to really figure out my expectation
I'd have to trace down the consequences of flips in all the different
pathways above in a probablistic way.  Too much work.  Simpler to say
that if I'm buying systems with OPM for doing professional work where
bitflips might give me embarrassingly wrong answers, I cheerfully spend
some of the OPM on ECC.  If I'm buying systems for myself, for my
desktop, for my home cluster/network, or on a small (as opposed to
large) chunk of OPM, then I don't worry about it and get consumer-grade
systems or motherboards to pop into the cases I already have.

   rgb

>
> Thanks,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977



More information about the Beowulf mailing list