[Beowulf] Multisocket mainboard hardware problems

Jon Aquilina eagles051387 at gmail.com
Fri Jan 16 13:47:50 PST 2009


i added the mailing list to this since you did not hit reply to all and i
have been the only one getting the replies. i think that is not fair and you
should be allowed to contact the manufacturer directly. i did that with
corsair cuz of some fault ram and im rma ing the paried set that i have back
to them. in all honesty i would contact the manufacturer and bypass the
vendor all together.

On Fri, Jan 16, 2009 at 10:11 PM, Francesco Pietra <chiendarret at gmail.com>wrote:

> To conclude, as it will be uninteresting to subscribers from here on,
> in Europe the customer can only contact the vendor of the Supermicro
> product. That gave no useful hint and the vendor does not answer any
> more. I asked which kind of test he wants to have in order to accept
> the mainboard for repair and he did not answer. Therefore, it could be
> a waste of time replacing the CPU (I have a spare one) unless it is
> just the CPU faulty, which (I believe) it is unlikely. If I prove that
> it was no faulty CPU, I could inform Beowulf and some friends here
> around about that discovery, or start a legal international action.
> Therefore, unless the CPU can be fully tested by software (and if
> faulty be replaced), I do nothing else that looking for another
> mainboard and assemble a new machine, this time for 16 logical
> processors. The more I have, the faster is the work. I understand that
> suggestions about the brand (obviously Supermicro is ruled out) can't
> be expected here.
> Thanks for all
> francesco
>
> On Fri, Jan 16, 2009 at 8:10 PM, Jon Aquilina <eagles051387 at gmail.com>
> wrote:
> > in that case you need to contact them by phone and request an rma
> >
> > On Fri, Jan 16, 2009 at 3:48 PM, Francesco Pietra <chiendarret at gmail.com
> >
> > wrote:
> >>
> >> That already tried. The slots from the bad bank are OK an  another
> >> motherboard. Vice versa, good slots from another mainboard do not work
> >> on the bad bank.
> >>
> >> I am no system expert, just a chemist, but I can only figure that the
> >> memory controller of the CPU is damaged. Otherwise the fault has
> >> arosen in the motherboard (voltage controller or something else).
> >>
> >> francesco
> >>
> >> On Fri, Jan 16, 2009 at 10:10 AM, Jon Aquilina <eagles051387 at gmail.com>
> >> wrote:
> >> > dunno bout another type of motherboard but do you have another stick
> of
> >> > ram
> >> > you can try in those sockets instead. if so it could be that you just
> >> > have
> >> > bad ram.
> >> >
> >> > On Fri, Jan 16, 2009 at 9:46 AM, Francesco Pietra
> >> > <chiendarret at gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi:
> >> >> Running memtest86+ v. 2.11 is the first test I carried out,
> repeatedly
> >> >> and until completion. It did not detect the slots at the faulty bank
> >> >> and did not show errors for the remaining RAM (18GB). Otherwise, the
> >> >> 6GB at the faulty bank are OK. I would like to test via software the
> >> >> memory controller of the CPU at the faulty bank, which I believe is
> >> >> the last chance for the mainboard not being damaged. All CPUs have
> >> >> correct hypertransport and I have replaced two 1GB slots with 2GB
> >> >> slots. Though, the 20GB come short for some of my calculations.
> >> >>
> >> >> As the Supermicro mainbord is only 8 months old (during which period
> >> >> it managed all 24GB RAM), I expected that Supermicro Europe takes
> >> >> action in some way. They simply stopped answering after having
> >> >> suggested something totally uninteresting.
> >> >>
> >> >> Therefore, in assembling a new 4 quad-core UMA system, I am looking
> >> >> for another brand of mainboards. Suggestions?
> >> >>
> >> >> francesco
> >> >>
> >> >> On Thu, Jan 15, 2009 at 10:21 PM, Jon Aquilina <
> eagles051387 at gmail.com>
> >> >> wrote:
> >> >> > try running memtest+86 its a cd that you boot on to that tests the
> >> >> > memory
> >> >> > leave it running for a few hrs to makes sure it is the ram or
> >> >> > sockets. i
> >> >> > am
> >> >> > not sure about how to test the cpu.
> >> >> >
> >> >> > On Tue, Jan 13, 2009 at 10:26 AM, Francesco Pietra
> >> >> > <francesco.pietra at accademialucchese.it> wrote:
> >> >> >>
> >> >> >> Hi:
> >> >> >>
> >> >> >> I am posting here from a suggestion on the Debian amd64 site. My
> >> >> >> original posting to the mainboard factory/vendor in Europe only
> >> >> >> resulted in uninteresting suggestions, and they did not answer any
> >> >> >> more.
> >> >> >>
> >> >> >> My question is directed to the attention of users familiar with
> >> >> >> multisocket UMA-type mainboards based on 875 dual opteron AMD CPU.
> >> >> >> My
> >> >> >> own is Supermicro H8QC8 with chipset nVidia CK804 and AMD 8132,
> >> >> >> driven
> >> >> >> by Debian Linux amd64 lenny.
> >> >> >>
> >> >> >> One of the CPUs has suddenly lost viability to its
> >> >> >> 4-slots memory bank (shut down the machine in order, the problem
> >> >> >> arose
> >> >> >> on
> >> >> >> next
> >> >> >> loading Linux). Still, the CPU cores are OK, hypertransport links
> >> >> >> are
> >> >> >> fully working, parallelization to both Amber 10 and NWChem 5.1 is
> >> >> >> fully provided, but one of the CPUs must be slower, having to
> borrow
> >> >> >> memory from the other
> >> >> >> banks. The hardware status, after a period of complete darkness,
> is
> >> >> >> described in the attached lshw_deb64_7Jan2009.txt.
> >> >> >>
> >> >> >> As each bank of Kingston DDR1 is filled 2+2+1+1 GB, I identified
> the
> >> >> >> faulty bank, removed all slots from there, and replaced the 1+1 GB
> >> >> >> slots at another bank with 2 + 2 GB from the faulty bank, so that
> >> >> >> now
> >> >> >> the computer is at 20GB. The situation is described in the
> attached
> >> >> >> lshw_deb64_lessCPU2_scrambling1G_2G_CPU4_7Jan2009.txt. Actually,
> >> >> >> identification of the CPU (CPU2) related to the faulty mem bank is
> >> >> >> insecure: I just considered the nearest CPU to the faulty bank.
> The
> >> >> >> manual is not helpful to this regard .
> >> >> >>
> >> >> >> I understand that, in order to remove non-mainboard causes, I
> should
> >> >> >> be certain that a CPU has not lost memory control. Since replacing
> >> >> >> (I
> >> >> >> have one spare second-hand CPU) or scrambling, the CPUs is quite
> >> >> >> troublesome, and risky, in my context (there is very little space
> >> >> >> around the mainboard in the rack that I engineered to accept the
> >> >> >> mainboard). Ventilation is excellent, however.
> >> >> >>
> >> >> >> Therefore, is it any software way to check if the CPUs are fully
> in
> >> >> >> order, including the memory controller? lshw and other software
> >> >> >> provided only partial help in my hands.
> >> >> >>
> >> >> >> Also any other suggestion would be greatly appreciated.
> >> >> >>
> >> >> >> Thanks for your kind attention
> >> >> >>
> >> >> >> francesco pietra
> >> >> >> _______________________________________________
> >> >> >> Beowulf mailing list, Beowulf at beowulf.org
> >> >> >> To change your subscription (digest mode or unsubscribe) visit
> >> >> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Jonathan Aquilina
> >> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jonathan Aquilina
> >> >
> >
> >
> >
> > --
> > Jonathan Aquilina
> >
>



-- 
Jonathan Aquilina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20090116/d31be116/attachment.html


More information about the Beowulf mailing list