Dual-Athlon Cluster Problems
siegert at sfu.ca
Thu Jan 23 10:52:05 PST 2003
On Thu, Jan 23, 2003 at 05:45:40PM +1100, Chris Steward wrote:
> We're in the process of setting up a new 32-node dual-athlon cluster running
> Redhat 7.3, kernel 2.4.18-19.7.smp. The configuration is attached below. We're
> having problems with nodes hanging during calculations, sometimes only after
> several hours of runtime. We have a serial console connected to such nodes but
> that is unable to interact with the nodes once they hang. Nothing is logged
> either. It seems that running jobs on one CPU doesn't seem to present too much
> of a problem, but when the machines are fully loaded (both CPU's 100%
> utilization) errors start to occur and machines die often up to 8 nodes
> within 24 hours. Temperature of nodes under full load is approximately 55C.
> We have tried using the "noapic" option but the problems still persist. Using
> other software not requiring enfuzion 6 also produces the same problems.
I run a cluster of 96 dual AMD nodes (Tyan 2462 mb).
> The seek feedback on the following:
> 1/ Are there issues using redhat 7.3 as opposed to 7.2 in such
> a setup ?
> 2/ Are there known issues with 2.4.18 kernels and AMD chips ?
> We suspect the problems are kernel related.
ACPI does not work and you may need a newer version of i2c/lm_sensors.
Both issues cannot account for the problems you are seeing (I haven't
checked RedHat's 2.4.18-19.7.smp kernel, but older versions were
compiled with ACPI disabled - for (as far as I can see) good reasons).
> 3/ Are there any problems with dual-athlon clusters using the
> MSI K7D Master L motherboard ?
> 4/ Are there any other outstanding issues with these machines
> under constant heavy load ?
In 99% of all the crashes I have seen on my cluster (and I have seen
a lot) the reason was bad memory. If you did not buy memory certified by
the company that sold you the motherboard exchange it and your problems
will go away.
[BTW: the "temperature" (defined has the highest of the three temperatures
displayed by lm_sensors) on the nodes ranges between 38C at the bottom of
the racks to 60C at the top. The crashes that I have seen on my cluster
were not correlated with the location of a node within a rack, i.e., they
did not seem to have anything to do with temperature.]
Academic Computing Services phone: (604) 291-4691
Simon Fraser University fax: (604) 291-4242
Burnaby, British Columbia email: siegert at sfu.ca
Canada V5A 1S6
More information about the Beowulf