More on cluster hang problem....
crhea at mayo.edu
Thu Jun 7 13:10:37 PDT 2001
First, let me thank the folks who offered suggestions on how to diagnose
this problem and suggestions for solutions...
Jon Tegner - Pointed me at a note on the VA linux tech list about RH7.1
messing up disk partitioning.
David van der Spoel - Pointer to a web site discussing flakey hardware
Tony Skjellum - Suggested using his company's commercial version of MPI.
Patrick Lesher - Suggested overheating and using a software package called
"sensors" to read MB temps.
Mark Hahn - There are known problems with the KT133A-based systems.
John LaBounty - How to force power off with an ATX power supply.
Robert G. Brown - In his experience, this points to a memory leak and/or
swap space issue.
Kevin Simpson - Script for monitoring for memory leak, etc.
Jacobs - Out of RAM issue.
Where we are now.......
Nothing jumped out after reading all the suggestions pointing to our problem.
John LaBounty's comments on ATX supplies were very helpful, as it allowed
us to power cycle a stuck node without rebooting the next node (on a
RackSaver RS1200, there are 2 systems in the same 1U box with a single
Some more data points and ideas-
1. Went to 2.4.4 kernel on all 4 nodes. No change in the behavior.
2. Built similar mini-cluster on 2 Dell 2450's that arrived for a
different project (1GHz PIII's). One system is a single CPU, the
other has two CPUs. Application runs perfectly on the Dells. Will
run to completion reliably (set in a loop to re-run after it finishes-
has so far, run 8 15-hour runs without a problem).
3. Issue with RAM and swap- swap was config'ed as 2X RAM (1GB physical RAM
in each system). Application does NOT memory leak (as measured by
xosview [cool little tool!]).
4. Nodes are named "rsnode1" ... "rsnode4". If we run only on nodes 3 and 4,
things run fine (again, no memory leak over a ~15 hour run). Will run on
these 2 nodes fine without crashing.
If I run on rsnode2, rsnode3 and rsnode4- It will crash rsnode2
after an hour or so. If I run on rsnode1 and rsnode2- it will crash
rsnode1 after ~10 mins.
5. No messages at all in /var/log/messages around the crash time.
I think I'm back to flakey hardware in rsnode1 and rsnode2. Any time I
involve rsnode1- things crash withing 10-15 minutes. Any time I involve
rsnode2, things go longer, but still crash. With two out of four systems
involved in the issue, I assumed it was a code/kernel issue rather than
just a simple hardware one.
Since these two nodes (rsnode1 and rsnode2) are physically in the same 1U
box, I suspect a batch of bad parts somewhere along the way.
I think my next experiment will be to configure MPI to use rsnode3 and
rsnode4 as well as the two Dell 2450's....
Stay tuned to the soap opera....
Cristopher J. Rhea Mayo Foundation
Research Computing Facility Pavilion 2-25
crhea at Mayo.EDU Rochester, MN 55905
Fax: (507) 266-4486 (507) 284-0587
More information about the Beowulf