[Beowulf] Tyan S2882
hahn at physics.mcmaster.ca
Thu Sep 28 07:17:27 PDT 2006
> * Dual AMP Opteron DP270 (2.0 GHz)
> * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB
> ( 12 nodes have 8*2GB)
this dimm is 2-rank, I believe; corsair's datasheet is pretty lame.
that means that each bank of memory is 4x2=8 ranks. that's definitely
pushing the limit; I'm sure it can be done in some cases, but it's definitely
not supported by some rev's of the opteron, and will always be pretty
> When a node crashes, we typically see a MCE + kernel panic. We get about
try running mcelog periodically; I bet you see lots of corrected ECC's.
> once and ran stable afterwards. Crashes seem to occur mostly when the system
> is under heavy CPU (memory?) load.
> Far too many correctable ECC errors are reported (on a subset of about 10-20
> nodes). Sometimes the ECC errors disappeared after I cyclically interchanged
> the memory modules within one node. There seems to be a weak correlation
> between the instabilities and the tendency to exhibit ECC errors.
IMO, the config is the problem, not the boards, cpus, dimms, etc.
> It seems that the last BIOS upgrade has reduced the ECC error rate
probably made the timing a little looser. does the bios let you tweak?
it would be interesting to know whether derating the clock (->pc2700)
helps this situation more or less than derating the latency.
regards, mark hahn.
More information about the Beowulf