[Beowulf] ECC support on motherboards?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govTue May 13 15:00:56 PDT 2008
- Previous message: [Beowulf] ECC support on motherboards?
- Next message: Frequency of upsets was Re: [Beowulf] ECC support on motherboards?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 02:16 PM 5/13/2008, Håkon Bugge wrote: >At 19:17 13.05.2008, Perry E. Metzger wrote: >>So another question is, how can you reliably test any of this stuff? >>It isn't like you can reliably induce single bit errors and see if the >>hardware catches them. (A special memory module that let you test >>would be a wonderful thing, but I've never even heard of such a thing.) We in the space hardware business do this all the time. (but, then, we're not building consumer priced stuff, by any means). Typically, what it means is that you provide a way to bypass the EDAC logic on writes or reads (which implies that the syndrome bits need some way to be addressed). You can write without EDAC, and then read with, or vice versa. We've also used dual port memories, where the second port is for diagnostics. Another approach is error injection at the data lines (there are logic analyzers that can do this). A bigger issue for most computers is upsets of configuration control bits of one sort or another. Unlike program and data memory, which is often being overwritten regularly anyway, configuration bits tend to get set once at startup/initialization, and then never changed. A particular problem if the bit controls whether a pin on a device is an input or output. >Well, you can trust the HW vs, the firmware. >Further, for some chipsets it is possible to >simply stop the memory refresh for some time >(~1 minute) while the system is idle. After >this, you enable it again, and you should see >single and/or double bit errors. This >enabling/disabling through setpci or other. If >you do not see errors after this, you can try to explain why... Maybe, maybe not. I wouldn't want to depend on the non-refreshed behavior of a refresh part, simply because it's undefined. Not refreshing might lead to bit errors, it might not (maybe it's internally refreshed, maybe its MRAM or Static Ram, masquerading as DRAM) >Once I wrote tool which examined all settings of >a particular chipset. That raised numerous questions to the vendor. > > >Hakon > > >>I'm doing the planning for a new cluster and the whole thing is >>remarkably bothersome. You can't easily figure out what motherboards >>will even pretend to do ECC that easily, you can't easily check once >>you have a sample motherboard in hand. It isn't even easy to get ECC >>memory for more modern standards. I'm starting to wonder if doing all >>calculations twice, once on each of two machines, isn't easier, but it >>seems utterly wrong to do that... >> >>Perry > >-- >Håkon Bugge >CTO >mob. +47 92 48 45 14 >off. +47 92 44 81 11 >fax. +47 22 23 36 66 >Hakon.Bugge at scali.com >Skype: hakon_bugge > >Scali - http://www.scali.com >Higher Performance Computing > > >_______________________________________________ >Beowulf mailing list, Beowulf at beowulf.org >To change your subscription (digest mode or >unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] ECC support on motherboards?
- Next message: Frequency of upsets was Re: [Beowulf] ECC support on motherboards?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
