BEOWULF cluster hangs

G.de-With G.de-With at herts.ac.uk
Thu Sep 26 07:16:52 PDT 2002


Hello

Since a month we have a LINUX BEOWULF cluster, the clusters contains 7 P4
dual processor 2GHz computers, with 8Gb of RAM per machine. For our network
we have used Gigabit ethernet.

The problem we have with our cluster is as follows.
When running large computational fluid simulations the simulation starts to
slow down. At some point the response of the computer is so poor that we
have to kill the simulation. In a worst case when the simulation was
running overnight the computer hangs and a reset of the computer is
necessary.
Nevertheless, even when we manage to kill the simulation in time the
computer remains very slow and a reboot is necessary to regain full
computer power.

My first suspicion was that the computer simply started swapping, but there
is no swap space being used, instead free RAM memory is still apparent
(between 5-10%) and only 90% of the RAM is in use whereby 50% is cached and
another 50% is in usage. In addition the cpu usage is very low as well.

May be it is of use to mention that this problem occurs with both
sequential and parallel simulations.


On our cluster we are running RH7.2 with the LINUX kernel version 2.4.7-10.
We have set-up our cluster using oscar-1.2.1rh72. The /home partition on
the world client is a shared via the network using NFS.

/etcfstab

192.168.1.100:/home /home nfs rw 0 2



1) In case anyone could do me some suggestions why our computers are
slowing down/hanging or if some one has got a similar experience please let
me know.
2) To my understanding the most important indicators to indicate the
computer usage are:
- memory usage
- cpu usage
Are there other key components/indicators which could lead to a reduction
in computer performance, and if so how can I see the status of them.

Govert


--
 ------------------------------------------------------------
| Dr. Govert de With     Research Fellow                     |
| Fluid Mechanics Research Group                             |
| University of Hertfordshire                                |
| Tel: 01707 284124 Fax: 01707 285086                        |
 ------------------------------------------------------------
| Der Horizont vieler Menschen ist ein Kreis mit Radius Null |
| und das nennen sie ihren Standpunkt.                       |
 ------------------------------------------------------------


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20020926/90fda839/attachment.html


More information about the Beowulf mailing list