Dual-Athlon Cluster Problems
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Martin Siegert siegert at sfu.caThu Jan 23 10:52:05 PST 2003
- Previous message: Dual-Athlon Cluster Problems
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Jan 23, 2003 at 05:45:40PM +1100, Chris Steward wrote: > > We're in the process of setting up a new 32-node dual-athlon cluster running > Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're > having problems with nodes hanging during calculations, sometimes only after > several hours of runtime. We have a serial console connected to such nodes but > that is unable to interact with the nodes once they hang. Nothing is logged > either. It seems that running jobs on one CPU doesn't seem to present too much > of a problem, but when the machines are fully loaded (both CPU's 100% > utilization) errors start to occur and machines die often up to 8 nodes > within 24 hours. Temperature of nodes under full load is approximately 55C. > We have tried using the "noapic" option but the problems still persist. Using > other software not requiring enfuzion 6 also produces the same problems. I run a cluster of 96 dual AMD nodes (Tyan 2462 mb). > The seek feedback on the following: > > 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such > a setup ? No. > 2/ Are there known issues with 2.4.18 kernels and AMD chips ? > We suspect the problems are kernel related. ACPI does not work and you may need a newer version of i2c/lm_sensors. Both issues cannot account for the problems you are seeing (I haven't checked RedHat's 2.4.18-19.7.smp kernel, but older versions were compiled with ACPI disabled - for (as far as I can see) good reasons). > 3/ Are there any problems with dual-athlon clusters using the > MSI K7D Master L motherboard ? Don't know. > 4/ Are there any other outstanding issues with these machines > under constant heavy load ? In 99% of all the crashes I have seen on my cluster (and I have seen a lot) the reason was bad memory. If you did not buy memory certified by the company that sold you the motherboard exchange it and your problems will go away. [BTW: the "temperature" (defined has the highest of the three temperatures displayed by lm_sensors) on the nodes ranges between 38C at the bottom of the racks to 60C at the top. The crashes that I have seen on my cluster were not correlated with the location of a node within a rack, i.e., they did not seem to have anything to do with temperature.] Martin ======================================================================== Martin Siegert Academic Computing Services phone: (604) 291-4691 Simon Fraser University fax: (604) 291-4242 Burnaby, British Columbia email: siegert at sfu.ca Canada V5A 1S6 ========================================================================
- Previous message: Dual-Athlon Cluster Problems
- Next message: Dual-Athlon Cluster Problems
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
