Riser card -mainboard conflicts?

Jeff Nguyen jeff at aslab.com
Wed Jan 8 10:17:40 PST 2003


Hi Donald,

Did you receive a recent Email that I sent regarding the request for
quotation? Please let me know if that message got through or not.

Jeff

----- Original Message -----
From: "Donald Becker" <becker at scyld.com>
To: <tegner at nada.kth.se>
Cc: <beowulf at beowulf.org>
Sent: Wednesday, January 08, 2003 8:19 AM
Subject: Re: Riser card -mainboard conflicts?


> On Wed, 8 Jan 2003 tegner at nada.kth.se wrote:
>
> > We have a cluster consisting of 30 athlon 2000+ nodes on a KT3 Ultra
> > MS-6380E mainboard (using ide discs) connected by a fast Ethernet
> > network.
> >
> > For the nodes we use 2U chassis, and the NIC and the graphic card sit on
a
> > PCI-301 riser card.
> ..
> > On one of the nodes we can newer get the network to function, there
> > are messages about bus-master dirty, PCI bus error, etc, and we never
> > get any contact with the rest of the cluster.
>
> PCI bus errors are a pretty clear indication that the riser cards are a
> problem.
>
> > The other nodes "seem" to work OK, but for some parallel applications
> > one or more of the nodes just "give up" after some time, and in those
> > cases we get similar messages as above - but it have also happened
> > that a node just died in which case we have to use the reset button to
> > get it back.
> ...
> > We start to suspect that mainboard and the riser card are in some way
> > incompatible, but we would greatly appreciate any hints of other
> > reasons for these problems.
>
> OK, here is an alternative: you have _both_ memory errors and PCI errors.
> Track down the PCI errors first.
>
> Not all drivers report PCI bus errors.  Especially with vendor-written
> drivers, there is a reason to ignore or silently recover from errors --
> the driver and hardware _appears_ more robust when there are no messages.
> The scary thing is that you might have silent data corruption from other
> devices.  Any driver that goes to the extra effort of reporting a bus
> error is doing you a big favor by pointing out the problem!
>
> --
> Donald Becker becker at scyld.com
> Scyld Computing Corporation http://www.scyld.com
> 410 Severn Ave. Suite 210 Scyld Beowulf cluster system
> Annapolis MD 21403 410-990-9993
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list