Dual-Athlon Cluster Problems
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon Jan 27 05:46:45 PST 2003
- Previous message: Dual-Athlon Cluster Problems
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 26 Jan 2003, Ben Ransom wrote: > It is still curious to me, that we can run other codes on Dolphin SCI and > showing 100% cpu utilization (full power/heat) on a ring away from a > suspect SCI card. This implies the reliability is code dependant, as other > have alluded. I spose this may be due to the amount of message > passing? Hopefully the problem will disappear once we get our full Dolphin > set in working order. Our code dependent reliability problems don't involve either Dolphin or MPI, but I think they may be real. The problem is reproducibility and extensibility. If the problem is really a memory leak or instruction bug in the fortran compiler (absent in e.g. gcc) then one would have to be able to test lots of compilers on lots of systems on the same code, and until one had GREAT statistics and reproducibility one would still be left wondering if it were really a hardware problem associated with bad cooling fans, bad memory DIMMS, etc etc on just some of the nodes. As it is, I can run my jobs (in gcc) on both our 2460 and 2466 nodes until the cows come home with only >>very<< rare crashes, probably hardware-related incidents. Another group here has maybe 4x my crash rate running their fortran-based code. Another problem is that it is amazingly difficult to isolate pure programmer error from all this. The other group is running Fortran because their code base (used to do high energy nuclear theory, e.g. lattice gauge computations) has been written by generations of graduate students in dozens of universities and nobody alive knows or has properly certified all its subroutines and aspects. OTOH, I personally wrote all the code I use (and don't even use a lot of prebuilt numerical library code, although that is changing with the GSL being fairly good and reliable). If my code crashes habitually, I have a prayer of debugging the crash point. If one of their routines leaks or manages to tweak some deep weakness in kernel or device enabled by some (mis)use of a system call, how can they know? rgb > > BTW, we are using dual 1800+ MP athlons on Tyan S2466 motherboard/ 760MX > chipset, Redhat 7.2 with 2.4.18smp, and as suggested above, I think > everything is probably fine with this. > > -Ben Ransom > UC Davis, Mech Engineering Dept > > At 05:45 PM 1/23/2003 +1100, Chris Steward wrote: > >Hi, > > > >We're in the process of setting up a new 32-node dual-athlon cluster running > >Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're > >having problems with nodes hanging during calculations, sometimes only after > >several hours of runtime. We have a serial console connected to such nodes but > >that is unable to interact with the nodes once they hang. Nothing is logged > >either. It seems that running jobs on one CPU doesn't seem to present too much > >of a problem, but when the machines are fully loaded (both CPU's 100% > >utilization) errors start to occur and machines die often up to 8 nodes > >within 24 hours. Temperature of nodes under full load is approximately 55C. > >We have tried using the "noapic" option but the problems still persist. Using > >other software not requiring enfuzion 6 also produces the same problems. > > > >The seek feedback on the following: > > > >1/ Are there issues using redhat 7.3 as opposed to 7.2 in such > > a setup ? > > > >2/ Are there known issues with 2.4.18 kernels and AMD chips ? > > We suspect the problems are kernel related. > > > >3/ Are there any problems with dual-athlon clusters using the > > MSI K7D Master L motherboard ? > > > >4/ Are there any other outstanding issues with these machines > > under constant heavy load ? > > > >Any advice/help would be greatly appreciated. > > > >Thanks in advance > > > >Chris > > > >-------------------------------------------------------------- > >Cluster configuration > > > >node configuration: > > > >CPU's: Athlon MP2000+ > >RAM: 1024Mb Kingston PC2100 DDR > >Operating system: Redhat 7.3 (with updates) > >Kernel: 2.4.18-19.7.xsmp > >Motherboard: MSI K7 Master L motherboard (Award Bios 1.5). > >Network: On-board PCI (Ethernet controller: Intel Corp. > >82559ER (rev 09)). (Using latest Intel drivers, "no sleep" option set) > > > >head-node: > > > >CPU single Athlon MP2000+ > > > >Dataserver: > > > >CPU: single Athlon MP2000 & > >Network: PCI Gigabit NIC > > > >Network Interconnect: > > > >cisco 2950 (one GBIC installed) > > > >Software: > > > >Cluster management Enfusion 6 > >Computational Dock V4.0.1 > > > > > > > > > >_______________________________________________ > >Beowulf mailing list, Beowulf at beowulf.org > >To change your subscription (digest mode or unsubscribe) visit > >http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: Dual-Athlon Cluster Problems
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
