Riser card -mainboard conflicts?

Donald Becker becker at scyld.com
Wed Jan 8 08:19:45 PST 2003


On Wed, 8 Jan 2003 tegner at nada.kth.se wrote:

> We have a cluster consisting of 30 athlon 2000+ nodes on a KT3 Ultra
> MS-6380E mainboard (using ide discs) connected by a fast Ethernet
> network.
> 
> For the nodes we use 2U chassis, and the NIC and the graphic card sit on a
> PCI-301 riser card.
..
> On one of the nodes we can newer get the network to function, there
> are messages about bus-master dirty, PCI bus error, etc, and we never
> get any contact with the rest of the cluster.

PCI bus errors are a pretty clear indication that the riser cards are a
problem.

> The other nodes "seem" to work OK, but for some parallel applications
> one or more of the nodes just "give up" after some time, and in those
> cases we get similar messages as above - but it have also happened
> that a node just died in which case we have to use the reset button to
> get it back.
...
> We start to suspect that mainboard and the riser card are in some way
> incompatible, but we would greatly appreciate any hints of other
> reasons for these problems.

OK, here is an alternative: you have _both_ memory errors and PCI errors.
Track down the PCI errors first.

Not all drivers report PCI bus errors.  Especially with vendor-written
drivers, there is a reason to ignore or silently recover from errors --
the driver and hardware _appears_ more robust when there are no messages.
The scary thing is that you might have silent data corruption from other
devices.  Any driver that goes to the extra effort of reporting a bus
error is doing you a big favor by pointing out the problem!

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Scyld Beowulf cluster system
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list