Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Help on cluster hang problem...

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Cris Rhea crhea at mayo.edu
Sat May 26 22:23:42 PDT 2001


I've been using Linux for several years, but am new to Linux cluster computing.

I set up a "proof of concept cluster" with 4 nodes- each node is a 1.2GHz Athlon
on a MicroStar K7TPro2-A motherboard with 1GB of RAM (RackSaver 1200). 

RedHat 7.1 is loaded locally on each system. Also loaded  mpich-1.2.0-10.i386.rpm
on each system and set up the rhosts/hosts.equiv to allow all the rsh stuff...

Systems are interconnected with Intel 10/100 Ethernet cards.

One of the research PhD's in my group has a program that has run successfully on
other supercomputer-class systems (Cray and SGI). Very CPU-intensive, but 
does nothing fancy other than using MPI for communication (very little disk I/O, 
etc.).

/home file system is NFS mounted on each system. I've tried NFS server is the master 
node or another system outside the cluster.

Even though this code runs as a normal user (not root), it will hard-hang the 
"master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on 
solid, doesn't respond to reset or power switches- have to reset by pulling plug.

I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2
kernel recompiled to specifically call the CPU an Athlon, and I've tried 
downloading/using the 2.4.4 kernel.  All of my attempts produce the same result- 
his program can crash the system every time it is run. 

I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing
interesting pops out. I must be missing something simple- the 2.4.X kernels
can't be that unstable.

Does this ring a bell with anyone in the group?

TIA-

-- Cris

---
 Cristopher J. Rhea                     Mayo Foundation
 Research Computing Facility             Pavilion 2-25
 crhea at Mayo.EDU                        Rochester, MN 55905
 Fax: (507) 266-4486                     (507) 284-0587











More information about the Beowulf mailing list