jducom at nd.edu
Wed Jul 31 13:25:18 PDT 2002
The nodes of our cluster are:
Dell Workstation Dual Xeon 1.7GHz 1GB RAM, RedHat 7.2 running 2.4.18
patched for IRQ balancing, Syskonnect SK9D21 GigEthernet
The cluster is heavily used for mpi programs using MPICH 1.2.4
Each node mount NFS directories w/ the following options:
ACPI is installed to overcome some APM issues w/ the poweroff command on
But some nodes hang sometimes for unknown reasons. They don't crash
though (they would reboot anyway: cat /proc/sys/kernel/panic -> 0 ).
There is no way to conect to them.
I installed serial console on some nodes (cf. my previous email about
remote serial console). When I connect thru the serial console to a hang
node, I even can't reboot the node BUT minicom shows that the machine is
It happens most of the time when MPI programs establish communications
What's going on? NFS hangs (but nothing in the /var/log/message and
other)? ACPI problem? Does the console dies? Switch issues?
More information about the Beowulf