Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Multisocket mainboard hardware problems

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Jon Aquilina eagles051387 at gmail.com
Fri Jan 16 13:47:50 PST 2009


i added the mailing list to this since you did not hit reply to all and i
have been the only one getting the replies. i think that is not fair and you
should be allowed to contact the manufacturer directly. i did that with
corsair cuz of some fault ram and im rma ing the paried set that i have back
to them. in all honesty i would contact the manufacturer and bypass the
vendor all together.

On Fri, Jan 16, 2009 at 10:11 PM, Francesco Pietra <chiendarret at gmail.com>wrote:

> To conclude, as it will be uninteresting to subscribers from here on,
> in Europe the customer can only contact the vendor of the Supermicro
> product. That gave no useful hint and the vendor does not answer any
> more. I asked which kind of test he wants to have in order to accept
> the mainboard for repair and he did not answer. Therefore, it could be
> a waste of time replacing the CPU (I have a spare one) unless it is
> just the CPU faulty, which (I believe) it is unlikely. If I prove that
> it was no faulty CPU, I could inform Beowulf and some friends here
> around about that discovery, or start a legal international action.
> Therefore, unless the CPU can be fully tested by software (and if
> faulty be replaced), I do nothing else that looking for another
> mainboard and assemble a new machine, this time for 16 logical
> processors. The more I have, the faster is the work. I understand that
> suggestions about the brand (obviously Supermicro is ruled out) can't
> be expected here.
> Thanks for all
> francesco
>
> On Fri, Jan 16, 2009 at 8:10 PM, Jon Aquilina <eagles051387 at gmail.com>
> wrote:
> > in that case you need to contact them by phone and request an rma
> >
> > On Fri, Jan 16, 2009 at 3:48 PM, Francesco Pietra <chiendarret at gmail.com
> >
> > wrote:
> >>
> >> That already tried. The slots from the bad bank are OK an  another
> >> motherboard. Vice versa, good slots from another mainboard do not work
> >> on the bad bank.
> >>
> >> I am no system expert, just a chemist, but I can only figure that the
> >> memory controller of the CPU is damaged. Otherwise the fault has
> >> arosen in the motherboard (voltage controller or something else).
> >>
> >> francesco
> >>
> >> On Fri, Jan 16, 2009 at 10:10 AM, Jon Aquilina <eagles051387 at gmail.com>
> >> wrote:
> >> > dunno bout another type of motherboard but do you have another stick
> of
> >> > ram
> >> > you can try in those sockets instead. if so it could be that you just
> >> > have
> >> > bad ram.
> >> >
> >> > On Fri, Jan 16, 2009 at 9:46 AM, Francesco Pietra
> >> > <chiendarret at gmail.com>
> >> > wrote:
> >> >>
> >> >> Hi:
> >> >> Running memtest86+ v. 2.11 is the first test I carried out,
> repeatedly
> >> >> and until completion. It did not detect the slots at the faulty bank
> >> >> and did not show errors for the remaining RAM (18GB). Otherwise, the
> >> >> 6GB at the faulty bank are OK. I would like to test via software the
> >> >> memory controller of the CPU at the faulty bank, which I believe is
> >> >> the last chance for the mainboard not being damaged. All CPUs have
> >> >> correct hypertransport and I have replaced two 1GB slots with 2GB
> >> >> slots. Though, the 20GB come short for some of my calculations.
> >> >>
> >> >> As the Supermicro mainbord is only 8 months old (during which period
> >> >> it managed all 24GB RAM), I expected that Supermicro Europe takes
> >> >> action in some way. They simply stopped answering after having
> >> >> suggested something totally uninteresting.
> >> >>
> >> >> Therefore, in assembling a new 4 quad-core UMA system, I am looking
> >> >> for another brand of mainboards. Suggestions?
> >> >>
> >> >> francesco
> >> >>
> >> >> On Thu, Jan 15, 2009 at 10:21 PM, Jon Aquilina <
> eagles051387 at gmail.com>
> >> >> wrote:
> >> >> > try running memtest+86 its a cd that you boot on to that tests the
> >> >> > memory
> >> >> > leave it running for a few hrs to makes sure it is the ram or
> >> >> > sockets. i
> >> >> > am
> >> >> > not sure about how to test the cpu.
> >> >> >
> >> >> > On Tue, Jan 13, 2009 at 10:26 AM, Francesco Pietra
> >> >> > <francesco.pietra at accademialucchese.it> wrote:
> >> >> >>
> >> >> >> Hi:
> >> >> >>
> >> >> >> I am posting here from a suggestion on the Debian amd64 site. My
> >> >> >> original posting to the mainboard factory/vendor in Europe only
> >> >> >> resulted in uninteresting suggestions, and they did not answer any
> >> >> >> more.
> >> >> >>
> >> >> >> My question is directed to the attention of users familiar with
> >> >> >> multisocket UMA-type mainboards based on 875 dual opteron AMD CPU.
> >> >> >> My
> >> >> >> own is Supermicro H8QC8 with chipset nVidia CK804 and AMD 8132,
> >> >> >> driven
> >> >> >> by Debian Linux amd64 lenny.
> >> >> >>
> >> >> >> One of the CPUs has suddenly lost viability to its
> >> >> >> 4-slots memory bank (shut down the machine in order, the problem
> >> >> >> arose
> >> >> >> on
> >> >> >> next
> >> >> >> loading Linux). Still, the CPU cores are OK, hypertransport links
> >> >> >> are
> >> >> >> fully working, parallelization to both Amber 10 and NWChem 5.1 is
> >> >> >> fully provided, but one of the CPUs must be slower, having to
> borrow
> >> >> >> memory from the other
> >> >> >> banks. The hardware status, after a period of complete darkness,
> is
> >> >> >> described in the attached lshw_deb64_7Jan2009.txt.
> >> >> >>
> >> >> >> As each bank of Kingston DDR1 is filled 2+2+1+1 GB, I identified
> the
> >> >> >> faulty bank, removed all slots from there, and replaced the 1+1 GB
> >> >> >> slots at another bank with 2 + 2 GB from the faulty bank, so that
> >> >> >> now
> >> >> >> the computer is at 20GB. The situation is described in the
> attached
> >> >> >> lshw_deb64_lessCPU2_scrambling1G_2G_CPU4_7Jan2009.txt. Actually,
> >> >> >> identification of the CPU (CPU2) related to the faulty mem bank is
> >> >> >> insecure: I just considered the nearest CPU to the faulty bank.
> The
> >> >> >> manual is not helpful to this regard .
> >> >> >>
> >> >> >> I understand that, in order to remove non-mainboard causes, I
> should
> >> >> >> be certain that a CPU has not lost memory control. Since replacing
> >> >> >> (I
> >> >> >> have one spare second-hand CPU) or scrambling, the CPUs is quite
> >> >> >> troublesome, and risky, in my context (there is very little space
> >> >> >> around the mainboard in the rack that I engineered to accept the
> >> >> >> mainboard). Ventilation is excellent, however.
> >> >> >>
> >> >> >> Therefore, is it any software way to check if the CPUs are fully
> in
> >> >> >> order, including the memory controller? lshw and other software
> >> >> >> provided only partial help in my hands.
> >> >> >>
> >> >> >> Also any other suggestion would be greatly appreciated.
> >> >> >>
> >> >> >> Thanks for your kind attention
> >> >> >>
> >> >> >> francesco pietra
> >> >> >> _______________________________________________
> >> >> >> Beowulf mailing list, Beowulf at beowulf.org
> >> >> >> To change your subscription (digest mode or unsubscribe) visit
> >> >> >> http://www.beowulf.org/mailman/listinfo/beowulf
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Jonathan Aquilina
> >> >> >
> >> >
> >> >
> >> >
> >> > --
> >> > Jonathan Aquilina
> >> >
> >
> >
> >
> > --
> > Jonathan Aquilina
> >
>



-- 
Jonathan Aquilina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20090116/d31be116/attachment.html


More information about the Beowulf mailing list