Help on cluster hang problem...

Tue Jun 5 08:40:18 PDT 2001

I missed the original post of this, sorry, but this problem description
is just crying out memory leak to me.  A memory leak is the one way I
know of for a program to be able to consistently crash a node from
userspace across kernel releases or hardware platforms.  I suppose there
are probably others, but offhand I cannot think of any.  The "ten
minutes" is especially revealing -- it cannot be just any old system bug
-- it has to be something that happens after roughly uniform interval.
If it is a 2.4.x associated bug (and doesn't happen for 2.2.x kernels)
it might alternatively be the problem associated with buffering and
caching and might be solvable by giving the nodes at least 2xMemory in
swap or no swap at all as discussed a few days ago.

I would try to debug this by cranking up a performance monitor of one
sort or another and watch all the components of memory (and other
system) utilization on the master node.  Of course you should also read
through /var/log/messages during the crash, as if the system e.g. runs
out of memory (or even screws up buffers somehow) there will usually be
telltale messages from systems daemons and so forth that crash a minute
or so before a critical component dies and the system goes away.

   rgb

> > I've been using Linux for several years, but am new to Linux cluster computing.
> > Even though this code runs as a normal user (not root), it will hard-hang the
> > "master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on
> > solid, doesn't respond to reset or power switches- have to reset by pulling plug.
> >
> > I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2
> > kernel recompiled to specifically call the CPU an Athlon, and I've tried
> > downloading/using the 2.4.4 kernel.  All of my attempts produce the same result-
> > his program can crash the system every time it is run.
> >
> > I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing
> > interesting pops out. I must be missing something simple- the 2.4.X kernels
> > can't be that unstable.
> >
> > Does this ring a bell with anyone in the group?
>
> Hi, doubt this is the reason, but could be worth checking out...
>
> /jon
>
>
> ----- Forwarded message from Theodore Tso <tytso at valinux.com> -----
>
> Envelope-to: bfinley at valinux.com
> Delivery-date: Mon, 21 May 2001 20:15:27 -0700
> From: Theodore Tso <tytso at valinux.com>
> To: tech at lists.valinux.com
> Subject: [VA-Tech] [tytso at MIT.EDU: Re: repeated forced fsck]
> Sender: tech-admin at lists.valinux.com
>
> Warning, it looks like there may be some cases where Red Hat 7.1's
> partitioning software may be producing corrupt partition tables.  This
> can cause filesystem corruption on the root partition if you're using
> LILO to boot your system.  Folks who are thinking about installing Red
> Hat 7.1 may want to check and make sure their partition table looks
> sane....
>
>                                                 - Ted
>
>
> --
>
> Envelope-to: tytso at localhost
> Delivery-date: Mon, 21 May 2001 23:05:05 -0400
> From: Theodore Tso <tytso at MIT.EDU>
> To: eichin-oa at boxedpenguin.com
> Cc: tytso at mit.edu
> Subject: Re: repeated forced fsck
>
> On Mon, May 21, 2001 at 08:43:50PM -0400, eichin-oa at boxedpenguin.com wrote:
> > This sounds a little odd, but didn't you mention having a problem on a
> > debian install with fsck happening on every boot?
>
> I don't remembering seeing such a problem with a Debian install, but I
> have seen this problem before, and I may have mentioned it to you.
>
> The problem is that the Linux kernel uses the LBA value to determine
> where the partition starts, but the LILO uses the CHS value to
> determine the partition start location (which it has to since it's
> using the BIOS functions) when it's writing out the first sector of
> the LILO map file, which it does on each boot because of a desire to
> make lilo -R a one-shot.
>
> So doing this will cause filesystem corruption (or at least some kind
> of corruption), since LILO will write out the map file to the wrong
> place.
>
> Whatever fdisk-like program Red Hat is using in 71, it's definitely
> really, really buggy.  I'm surprised they didn't catch this in their
> testing.
>
>                                                 - Ted
>
> >
> > ------- Start of forwarded message -------
> > To: Simon Josefsson <simon+openafs-info at josefsson.org>
> > Subject: Re: [OpenAFS] rh71, oafs 1.04: unloading unused kernel module crash machine
> > Message-ID: <990430724.3b08c60494374 at mail1.nada.kth.se>
> > From: tegner at nada.kth.se
> > Cc: openafs-info at openafs.org
> > References: <ilur8xl2vwg.fsf at barbar.josefsson.org>
> > MIME-Version: 1.0
> > Content-Type: text/plain; charset=iso-8859-1
> > Content-Transfer-Encoding: 8bit
> > Date: Mon, 21 May 2001 09:38:44 +0200 (MET DST)
> >
> > Probably not related, but have had disk problems with RH 7.1 (e.g. constantly
> > forced fsck on reboot). This has happened on two (independent) machines which
> > were upgraded from RH 6.2, and seems to be a result of incompatible LBA and CHS
> > values. Obtained the following from Partition Magic
> >
> > ``Partition Magic has detected an error on the partition starting at sector
> > 19390455 on disk 1. The starting LBA value is 19390455 and the CHS value is
> > 16450559. The LBA value and the CHS value must be equal. Partition Magic has
> > verified that the LBA value is correct and can fix the CHS value''.
> >
> > Have also experienced this on a fresh install of RH 7.1 on a machine with a 75
> > Gb disk.
> >
> > /jon
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu