Dual-Athlon Cluster Problems

Mon Jan 27 05:46:45 PST 2003

On Sun, 26 Jan 2003, Ben Ransom wrote:

> It is still curious to me, that we can run other codes on Dolphin SCI and 
> showing 100% cpu utilization (full power/heat) on a ring away from a 
> suspect SCI card.  This implies the reliability is code dependant, as other 
> have alluded.  I spose this may be due to the amount of message 
> passing?  Hopefully the problem will disappear once we get our full Dolphin 
> set in working order.

Our code dependent reliability problems don't involve either Dolphin or
MPI, but I think they may be real.  The problem is reproducibility and
extensibility.  If the problem is really a memory leak or instruction
bug in the fortran compiler (absent in e.g. gcc) then one would have to
be able to test lots of compilers on lots of systems on the same code,
and until one had GREAT statistics and reproducibility one would still
be left wondering if it were really a hardware problem associated with
bad cooling fans, bad memory DIMMS, etc etc on just some of the nodes.

As it is, I can run my jobs (in gcc) on both our 2460 and 2466 nodes
until the cows come home with only >>very<< rare crashes, probably
hardware-related incidents.  Another group here has maybe 4x my crash
rate running their fortran-based code.

Another problem is that it is amazingly difficult to isolate pure
programmer error from all this.  The other group is running Fortran
because their code base (used to do high energy nuclear theory, e.g.
lattice gauge computations) has been written by generations of graduate
students in dozens of universities and nobody alive knows or has
properly certified all its subroutines and aspects.  OTOH, I personally
wrote all the code I use (and don't even use a lot of prebuilt numerical
library code, although that is changing with the GSL being fairly good
and reliable).  If my code crashes habitually, I have a prayer of
debugging the crash point.  If one of their routines leaks or manages to
tweak some deep weakness in kernel or device enabled by some (mis)use of
a system call, how can they know?

   rgb

> 
> BTW, we are using dual 1800+ MP athlons on Tyan S2466 motherboard/ 760MX 
> chipset, Redhat 7.2 with 2.4.18smp, and as suggested above, I think 
> everything is probably fine with this.
> 
> -Ben Ransom
>   UC Davis, Mech Engineering Dept
> 
> At 05:45 PM 1/23/2003 +1100, Chris Steward wrote:
> >Hi,
> >
> >We're in the process of setting up a new 32-node dual-athlon cluster running
> >Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
> >having problems with nodes hanging during calculations, sometimes only after
> >several hours of runtime. We have a serial console connected to such nodes but
> >that is unable to interact with the nodes once they hang. Nothing is logged
> >either. It seems that running jobs on one CPU doesn't seem to present too much
> >of a problem, but when the machines are fully loaded (both CPU's 100%
> >utilization) errors start to occur and machines die  often up to 8 nodes
> >within 24 hours. Temperature of nodes under full load is approximately 55C.
> >We have tried using the "noapic" option but the problems still persist.  Using
> >other software not requiring enfuzion 6 also produces the same problems.
> >
> >The seek feedback on the following:
> >
> >1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
> >    a setup ?
> >
> >2/ Are there known issues with 2.4.18 kernels and AMD chips ?
> >    We suspect the problems are kernel related.
> >
> >3/ Are there any problems with dual-athlon clusters using the
> >    MSI K7D Master L motherboard ?
> >
> >4/ Are there any other outstanding issues with these machines
> >    under constant heavy load ?
> >
> >Any advice/help would be greatly appreciated.
> >
> >Thanks in advance
> >
> >Chris
> >
> >--------------------------------------------------------------
> >Cluster configuration
> >
> >node configuration:
> >
> >CPU's:                   Athlon MP2000+
> >RAM:                      1024Mb Kingston PC2100 DDR
> >Operating system:         Redhat 7.3 (with updates)
> >Kernel:                  2.4.18-19.7.xsmp
> >Motherboard:             MSI K7 Master L motherboard (Award Bios 1.5).
> >Network:                 On-board PCI (Ethernet controller: Intel Corp.
> >82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)
> >
> >head-node:
> >
> >CPU                        single Athlon MP2000+
> >
> >Dataserver:
> >
> >CPU:                      single Athlon MP2000 &
> >Network:                  PCI Gigabit NIC
> >
> >Network Interconnect:
> >
> >cisco 2950 (one GBIC installed)
> >
> >Software:
> >
> >Cluster management      Enfusion 6
> >Computational           Dock V4.0.1
> >
> >
> >
> >
> >_______________________________________________
> >Beowulf mailing list, Beowulf at beowulf.org
> >To change your subscription (digest mode or unsubscribe) visit 
> >http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu