More on cluster hang problem....

Thu Jun 7 13:10:37 PDT 2001

First, let me thank the folks who offered suggestions on how to diagnose
this problem and suggestions for solutions...

---------------------------------

Jon Tegner - Pointed me at a note on the VA linux tech list about RH7.1 
	     messing up disk partitioning.

David van der Spoel - Pointer to a web site discussing flakey hardware
		      (esp memory).

Tony Skjellum - Suggested using his company's commercial version of MPI.

Patrick Lesher - Suggested overheating and using a software package called
		 "sensors" to read MB temps.

Mark Hahn - There are known problems with the KT133A-based systems.

John LaBounty - How to force power off with an ATX power supply.

Robert G. Brown - In his experience, this points to a memory leak and/or
		  swap space issue.

Kevin Simpson - Script for monitoring for memory leak, etc.

Jacobs - Out of RAM issue.

---------------------------------

Where we are now.......

Nothing jumped out after reading all the suggestions pointing to our problem.
John LaBounty's comments on ATX supplies were very helpful, as it allowed 
us to power cycle a stuck node without rebooting the next node (on a 
RackSaver RS1200, there are 2 systems in the same 1U box with a single
power cord).

Some more data points and ideas-

1. Went to 2.4.4 kernel on all 4 nodes. No change in the behavior.

2. Built similar mini-cluster on 2 Dell 2450's that arrived for a 
   different project (1GHz PIII's). One system is a single CPU, the
   other has two CPUs.  Application runs perfectly on the Dells. Will
   run to completion reliably (set in a loop to re-run after it finishes-
   has so far, run 8 15-hour runs without a problem).

3. Issue with RAM and swap- swap was config'ed as 2X RAM (1GB physical RAM 
   in each system). Application does NOT memory leak (as measured by 
   xosview [cool little tool!]). 

4. Nodes are named "rsnode1" ... "rsnode4". If we run only on nodes 3 and 4,
   things run fine (again, no memory leak over a ~15 hour run). Will run on
   these 2 nodes fine without crashing.

   If I run on rsnode2, rsnode3 and rsnode4-  It will crash rsnode2 
   after an hour or so. If I run on rsnode1 and rsnode2- it will crash 
   rsnode1 after ~10 mins.

5. No messages at all in /var/log/messages around the crash time. 

I think I'm back to flakey hardware in rsnode1 and rsnode2. Any time I 
involve rsnode1- things crash withing 10-15 minutes. Any time I involve 
rsnode2, things go longer, but still crash. With two out of four systems
involved in the issue, I assumed it was a code/kernel issue rather than 
just a simple hardware one.

Since these two nodes (rsnode1 and rsnode2) are physically in the same 1U
box, I suspect a batch of bad parts somewhere along the way. 

I think my next experiment will be to configure MPI to use rsnode3 and
rsnode4 as well as the two Dell 2450's....

Stay tuned to the soap opera....

--- Cris

---
 Cristopher J. Rhea                     Mayo Foundation
 Research Computing Facility             Pavilion 2-25
 crhea at Mayo.EDU                        Rochester, MN 55905
 Fax: (507) 266-4486                     (507) 284-0587