Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Multisocket mainboard hardware problems

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jon Aquilina eagles051387 at gmail.com
Thu Jan 15 13:21:27 PST 2009


try running memtest+86 its a cd that you boot on to that tests the memory
leave it running for a few hrs to makes sure it is the ram or sockets. i am
not sure about how to test the cpu.

On Tue, Jan 13, 2009 at 10:26 AM, Francesco Pietra <
francesco.pietra at accademialucchese.it> wrote:

> Hi:
>
> I am posting here from a suggestion on the Debian amd64 site. My
> original posting to the mainboard factory/vendor in Europe only
> resulted in uninteresting suggestions, and they did not answer any
> more.
>
> My question is directed to the attention of users familiar with
> multisocket UMA-type mainboards based on 875 dual opteron AMD CPU. My
> own is Supermicro H8QC8 with chipset nVidia CK804 and AMD 8132, driven
> by Debian Linux amd64 lenny.
>
> One of the CPUs has suddenly lost viability to its
> 4-slots memory bank (shut down the machine in order, the problem arose on
> next
> loading Linux). Still, the CPU cores are OK, hypertransport links are
> fully working, parallelization to both Amber 10 and NWChem 5.1 is
> fully provided, but one of the CPUs must be slower, having to borrow
> memory from the other
> banks. The hardware status, after a period of complete darkness, is
> described in the attached lshw_deb64_7Jan2009.txt.
>
> As each bank of Kingston DDR1 is filled 2+2+1+1 GB, I identified the
> faulty bank, removed all slots from there, and replaced the 1+1 GB
> slots at another bank with 2 + 2 GB from the faulty bank, so that now
> the computer is at 20GB. The situation is described in the attached
> lshw_deb64_lessCPU2_scrambling1G_2G_CPU4_7Jan2009.txt. Actually,
> identification of the CPU (CPU2) related to the faulty mem bank is
> insecure: I just considered the nearest CPU to the faulty bank. The
> manual is not helpful to this regard .
>
> I understand that, in order to remove non-mainboard causes, I should
> be certain that a CPU has not lost memory control. Since replacing (I
> have one spare second-hand CPU) or scrambling, the CPUs is quite
> troublesome, and risky, in my context (there is very little space
> around the mainboard in the rack that I engineered to accept the
> mainboard). Ventilation is excellent, however.
>
> Therefore, is it any software way to check if the CPUs are fully in
> order, including the memory controller? lshw and other software
> provided only partial help in my hands.
>
> Also any other suggestion would be greatly appreciated.
>
> Thanks for your kind attention
>
> francesco pietra
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>



-- 
Jonathan Aquilina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20090115/644e29ff/attachment.html


More information about the Beowulf mailing list