BEOWULF cluster hangs

Robert G. Brown rgb at phy.duke.edu
Thu Sep 26 16:00:16 PDT 2002


On Thu, 26 Sep 2002, Michael Prinkey wrote:

> There are many problems with the virtual memory manager in the 2.4 
> series of kernels.  These have been mostly fixed in the later 2.4 
> series.  I recommend trying 2.4.19 and see if this fixes the problem.

To second this, if you just upgrade to RH 7.3 off a current image
(including the updates) you will both get a much more current kernel and
you'll ensure that certain pernicious bugs or security holes elsewhere
in your network (assuming your desktops are also 7.2) are closed.

...he says with a grimace after being "slapped" on a machine where the
nightly cron update that was supposed to have automatically updated
apache and mod_ssl failed and a second problem caused the failure not to
generate an error report by mail.  Sigh.

This (to relate the point to recent ongoing discussion) is one reason I
prefer a cluster to be based on a current and "standard" distribution --
updates for performance and/or security reasons are key, and it isn't
convenient or cost effective for local admins to have to build/patch
every one themselves locally.  Note that something like Scyld is OK if
you hide it behind a firewall (in part because the scyld people work
very hard on fixing things to the extent that it becomes a properly
supported "distribution" in and of itself).  DIY-ers should probably try
very hard to work with a current distribution whenever possible, and to
install an automated update mechanism as well.

Of course, it helps even more if you don't totally trust the update and
check by hand from time to time (he says, still smarting from the
slap:-).

   rgb

> 
> Mike Prinkey
> Aeolus Research, Inc.
> 
> 
> G.de-With wrote:
> 
> > Hello
> >
> > Since a month we have a LINUX BEOWULF cluster, the clusters contains 7 
> > P4 dual processor 2GHz computers, with 8Gb of RAM per machine. For our 
> > network we have used Gigabit ethernet.
> >
> > The problem we have with our cluster is as follows.
> > When running large computational fluid simulations the simulation 
> > starts to slow down. At some point the response of the computer is so 
> > poor that we have to kill the simulation. In a worst case when the 
> > simulation was running overnight the computer hangs and a reset of the 
> > computer is necessary.
> > Nevertheless, even when we manage to kill the simulation in time the 
> > computer remains very slow and a reboot is necessary to regain full 
> > computer power.
> >
> > My first suspicion was that the computer simply started swapping, but 
> > there is no swap space being used, instead free RAM memory is still 
> > apparent
> > (between 5-10%) and only 90% of the RAM is in use whereby 50% is 
> > cached and another 50% is in usage. In addition the cpu usage is very 
> > low as well.
> >
> > May be it is of use to mention that this problem occurs with both 
> > sequential and parallel simulations.
> >  
> >
> > On our cluster we are running RH7.2 with the LINUX kernel version 
> > 2.4.7-10. We have set-up our cluster using oscar-1.2.1rh72. The /home 
> > partition on the world client is a shared via the network using NFS.
> >
> > /etcfstab
> >
> > 192.168.1.100:/home /home nfs rw 0 2
> >  
> >  
> >
> > 1) In case anyone could do me some suggestions why our computers are 
> > slowing down/hanging or if some one has got a similar experience 
> > please let me know.
> > 2) To my understanding the most important indicators to indicate the 
> > computer usage are:
> > - memory usage
> > - cpu usage
> > Are there other key components/indicators which could lead to a 
> > reduction in computer performance, and if so how can I see the status 
> > of them.
> >
> > Govert
> >  
> >
> >-- 
> > ------------------------------------------------------------
> >| Dr. Govert de With     Research Fellow                     |
> >| Fluid Mechanics Research Group                             |
> >| University of Hertfordshire                                |
> >| Tel: 01707 284124 Fax: 01707 285086                        |
> > ------------------------------------------------------------
> >| Der Horizont vieler Menschen ist ein Kreis mit Radius Null |
> >| und das nennen sie ihren Standpunkt.                       |
> > ------------------------------------------------------------
> >
> >  
> 
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu






More information about the Beowulf mailing list