Dual-Athlon Cluster Problems

Sun Jan 26 22:13:09 PST 2003

We have had a 20 node dual athlon cluster up (sort of) since June 2002, and 
have experienced more problems than I anticipated.  I was a newbie at that 
time.  We didn't get down to trying multi-day runs until 5-6 months after 
new, at which time problems began to appear. At first we wondered if our 
problems were because of long runs, or just a coincidence.

I agree, it can be very difficult to pin down the source of troubles.  Our 
first discovery was that we had a batch of bad cpu cooling fans.  These 
were AMD fans and we were told that AMD had used a different fan or bearing 
supplier sometime in early 2002.  Anyway, they shipped us 40 new fans and 
we (ugh) replaced them all.  When opening up all nodes to replace fans, I 
would estimate over half were showing some degree of trouble (vibration, 
speed, etc).  After this episode, we worked up to confidence in cooling and 
power supply by running code with ethernet MPI on all nodes (different 100% 
cpu runs on groups of 4 nodes) for at least 48 hours.

As all this was settling out, we were still having trouble getting some of 
our code to run on more than 4 nodes with Dolphin SCI MPI.  We had a bad 
dolphin card, and were delayed in getting that figured out and replaced 
over Xmas break.  Unfortunately, that hasn't been the end of it.  We are 
still unable to run our primary code on the Dolphin, and yet somewhat 
confident in the rest of the cluster from the successful ethernet runs.

One of the reasons we chose Dolphin over Myrinet was thinking that we'd 
avoid the single point of failure in a Myrinet switch.  This was bad 
judgement, as we now know that a bad card (or cable) in our Dolphin setup 
not only crashes a random node's kernel during a run with high message 
passing, but as well, it seemingly prevents us from even launching that 
code on other isolated dolphin rings, i.e. rings that don't include a node 
with suspect SCI card.  Isolating which card or cable is bad requires time, 
and experience   ...which I am gaining ;/   .

It is still curious to me, that we can run other codes on Dolphin SCI and 
showing 100% cpu utilization (full power/heat) on a ring away from a 
suspect SCI card.  This implies the reliability is code dependant, as other 
have alluded.  I spose this may be due to the amount of message 
passing?  Hopefully the problem will disappear once we get our full Dolphin 
set in working order.

BTW, we are using dual 1800+ MP athlons on Tyan S2466 motherboard/ 760MX 
chipset, Redhat 7.2 with 2.4.18smp, and as suggested above, I think 
everything is probably fine with this.

-Ben Ransom
  UC Davis, Mech Engineering Dept

At 05:45 PM 1/23/2003 +1100, Chris Steward wrote:
>Hi,
>
>We're in the process of setting up a new 32-node dual-athlon cluster running
>Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
>having problems with nodes hanging during calculations, sometimes only after
>several hours of runtime. We have a serial console connected to such nodes but
>that is unable to interact with the nodes once they hang. Nothing is logged
>either. It seems that running jobs on one CPU doesn't seem to present too much
>of a problem, but when the machines are fully loaded (both CPU's 100%
>utilization) errors start to occur and machines die  often up to 8 nodes
>within 24 hours. Temperature of nodes under full load is approximately 55C.
>We have tried using the "noapic" option but the problems still persist.  Using
>other software not requiring enfuzion 6 also produces the same problems.
>
>The seek feedback on the following:
>
>1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
>    a setup ?
>
>2/ Are there known issues with 2.4.18 kernels and AMD chips ?
>    We suspect the problems are kernel related.
>
>3/ Are there any problems with dual-athlon clusters using the
>    MSI K7D Master L motherboard ?
>
>4/ Are there any other outstanding issues with these machines
>    under constant heavy load ?
>
>Any advice/help would be greatly appreciated.
>
>Thanks in advance
>
>Chris
>
>--------------------------------------------------------------
>Cluster configuration
>
>node configuration:
>
>CPU's:                   Athlon MP2000+
>RAM:                      1024Mb Kingston PC2100 DDR
>Operating system:         Redhat 7.3 (with updates)
>Kernel:                  2.4.18-19.7.xsmp
>Motherboard:             MSI K7 Master L motherboard (Award Bios 1.5).
>Network:                 On-board PCI (Ethernet controller: Intel Corp.
>82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set)
>
>head-node:
>
>CPU                        single Athlon MP2000+
>
>Dataserver:
>
>CPU:                      single Athlon MP2000 &
>Network:                  PCI Gigabit NIC
>
>Network Interconnect:
>
>cisco 2950 (one GBIC installed)
>
>Software:
>
>Cluster management      Enfusion 6
>Computational           Dock V4.0.1
>
>
>
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf