[Beowulf] Tyan S2882
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Eric W. Biederman ebiederm at xmission.comThu Sep 28 07:02:07 PDT 2006
- Previous message: [Beowulf] Tyan S2882
- Next message: [Beowulf] Tyan S2882
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Gebhardt Thomas <gebhardt at hrz.uni-marburg.de> writes: > Hi, > >> We are currently deploying Tyan S2882 Dual Opteron Boards, and we have >> found the system to be quite unstable. After BIOS updates and kernel >> changes we still get random kernel panics when under load. > > Me too :-( > > We've got a 85 Node Dual Opteron Cluster. I've documented most of the > crashes on > http://clust-doc.hrz.uni-marburg.de/index.php/Hardware_Bulletin . > > Our equipment: > > * Dual AMP Opteron DP270 (2.0 GHz) > * MB: TYAN S2882G3-DNR > * Mem: 8*1GB PC3200 (DDR 400) ECC reg.; Corsair/Samsung CM72SD1024RLP-3200/SB > ( 12 nodes have 8*2GB) > * PS: EMACS P1 6400P > * HD: 250 GB SATA from Western Digital > > Dist: Debian/Sarge amd64 > Kernel: various, currently 2.6.15.3 from kernel.org > BIOS: (most recent, as far as I know) > > When a node crashes, we typically see a MCE + kernel panic. We get about > 2 crashes per week on our 85 node cluster. Some nodes seem to be more unstable > than others but we also see instabilities on nodes that had been stable so > far. The instabilities are very hard to reproduce: we have nodes that crashed > once and ran stable afterwards. Crashes seem to occur mostly when the system > is under heavy CPU (memory?) load. I bet if you decode the MCE it will say uncorrectable ECC memory error. > Far too many correctable ECC errors are reported (on a subset of about 10-20 > nodes). Sometimes the ECC errors disappeared after I cyclically interchanged > the memory modules within one node. There seems to be a weak correlation > between the instabilities and the tendency to exhibit ECC errors. memtest86 > runs fine on the momory modules. memtest86 doesn't see correctable memory errors. > It seems that the last BIOS upgrade has reduced the ECC error rate > somewhat. > > We definitely have no temperature problem. As far as I can see (libsensor) > the voltages are ok, too. It sounds like you have a pile of correctable (soft?) memory errors that occasionally become uncorrectable. Good Luck, Eric
- Previous message: [Beowulf] Tyan S2882
- Next message: [Beowulf] Tyan S2882
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
