[Beowulf] Multisocket mainboard hardware problems

Francesco Pietra francesco.pietra at accademialucchese.it
Tue Jan 13 01:26:10 PST 2009


Hi:

I am posting here from a suggestion on the Debian amd64 site. My
original posting to the mainboard factory/vendor in Europe only
resulted in uninteresting suggestions, and they did not answer any
more.

My question is directed to the attention of users familiar with
multisocket UMA-type mainboards based on 875 dual opteron AMD CPU. My
own is Supermicro H8QC8 with chipset nVidia CK804 and AMD 8132, driven
by Debian Linux amd64 lenny.

One of the CPUs has suddenly lost viability to its
4-slots memory bank (shut down the machine in order, the problem arose on next
loading Linux). Still, the CPU cores are OK, hypertransport links are
fully working, parallelization to both Amber 10 and NWChem 5.1 is
fully provided, but one of the CPUs must be slower, having to borrow
memory from the other
banks. The hardware status, after a period of complete darkness, is
described in the attached lshw_deb64_7Jan2009.txt.

As each bank of Kingston DDR1 is filled 2+2+1+1 GB, I identified the
faulty bank, removed all slots from there, and replaced the 1+1 GB
slots at another bank with 2 + 2 GB from the faulty bank, so that now
the computer is at 20GB. The situation is described in the attached
lshw_deb64_lessCPU2_scrambling1G_2G_CPU4_7Jan2009.txt. Actually,
identification of the CPU (CPU2) related to the faulty mem bank is
insecure: I just considered the nearest CPU to the faulty bank. The
manual is not helpful to this regard .

I understand that, in order to remove non-mainboard causes, I should
be certain that a CPU has not lost memory control. Since replacing (I
have one spare second-hand CPU) or scrambling, the CPUs is quite
troublesome, and risky, in my context (there is very little space
around the mainboard in the rack that I engineered to accept the
mainboard). Ventilation is excellent, however.

Therefore, is it any software way to check if the CPUs are fully in
order, including the memory controller? lshw and other software
provided only partial help in my hands.

Also any other suggestion would be greatly appreciated.

Thanks for your kind attention

francesco pietra



More information about the Beowulf mailing list