Mysterious kernel hangs

Felix Rauch rauch at inf.ethz.ch
Thu Mar 15 05:34:22 PST 2001


We recently bought a new 16 node cluster with dual 1 GHz PentiumIII
nodes, but machines mysteriously freeze :-(

The nodes have STL2 boards (Version A28808-301), onboard adaptec SCSI
controllers (7899P), onboard intel Fast Ethernet adapters (82557
[Ethernet Pro 100]) and additional Packet Engines Hamachi GNIC-II
Gigabit Ethernet cards.

We tried kernels 2.2.x, 2.4.1 and now even 2.4.2-ac20, but it seems to
be the same problem with all kernels: When we run experiments which
use the network intensively, any of the machines will just freeze
after a few hours. The frozen machine does not respond to anything and
up to now we were not able to see any log-entries related to the
freeze on virtual console 10 :-(   We switched now on all the "Kernel
Hacking" stuff in the kernel configuration (especially the logging)
and we will try again, hopefuly we will at least see some log outputs.

The freezes do also happen if we let non-network-intensive jobs run on
the machines (e.g. SETI at home), but clearly they happen less often.

Does anyone of you have any ideas what could go wrong or what we could
try to find the cause of the problems?

Regards,
Felix
-- 
Felix Rauch                      | Email: rauch at inf.ethz.ch
Institute for Computer Systems   | Homepage: http://www.cs.inf.ethz.ch/~rauch/
ETH Zentrum / RZ H18             | Phone: ++41 1 632 7489
CH - 8092 Zuerich / Switzerland  | Fax:   ++41 1 632 1307





More information about the Beowulf mailing list