[Beowulf] ECC support on motherboards?
landman at scalableinformatics.com
Tue May 13 15:45:45 PDT 2008
Perry E. Metzger wrote:
> Håkon Bugge <Hakon.Bugge at scali.com> writes:
>> Its even worse. On one mtbd; the BIOS had a menu for enabling ECC; I
>> did. But reading the register from the chipset revealed nothing was
>> actually enabled in the hardware. You have to be paranoid in this
>> business. This was a "bleeding edge" mtbd, with a low revision BIOS of
>> course. The fu being that a car manufacturer ran a cluster of these
>> for several months doing crash worthiness simulations ...
> So another question is, how can you reliably test any of this stuff?
> It isn't like you can reliably induce single bit errors and see if the
> hardware catches them. (A special memory module that let you test
.... actually ... you can. Run your code, and have it beat on RAM. We
Some folks use memtest* and variants, and it catches some base errors.
But it doesn't exercise things the way the application does. So we use
a number of GAMESS runs and other large ram things. Beats the heck out
of the unit. We get a very good indication if it starts tossing MCE
errors that there is a real memory issue.
And, for those doubters, yes, we have caught errors with this that
memtest* did not catch. And yes, we could reliably reproduce them.
All our systems, regardless of their function run with these tests
specifically to try to force MCE errors.
> would be a wonderful thing, but I've never even heard of such a thing.)
> I'm doing the planning for a new cluster and the whole thing is
> remarkably bothersome. You can't easily figure out what motherboards
> will even pretend to do ECC that easily, you can't easily check once
> you have a sample motherboard in hand. It isn't even easy to get ECC
> memory for more modern standards. I'm starting to wonder if doing all
> calculations twice, once on each of two machines, isn't easier, but it
> seems utterly wrong to do that...
Hmmm.... sounds to me like you probably need to work with groups that
have done this and do this for a living (deliver working systems to
customers, and help them figure out what they need to do). Bug Don
Becker and his team (Penguin), and a bunch of others hanging around here
(and us if you like).
It actually is not hard to build a system with ECC capability. Most
vendors, the vast majority of them, leave the bios default settings and
assume they are "good enough". We don't normally advise that.
Greg L or someone spoke about scrubbing. You can enable that. It is
generally a good idea (we recommend it). Yeah, it does eat memory
bandwidth. And it does slow down access to ram. The is a cost for
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax : +1 866 888 3112
cell : +1 734 612 4615
More information about the Beowulf