Help on cluster hang problem...

Jon Tegner Jon.Tegner at wiglaf.se
Sat May 26 23:15:21 PDT 2001


> I've been using Linux for several years, but am new to Linux cluster computing.
>
> I set up a "proof of concept cluster" with 4 nodes- each node is a 1.2GHz Athlon
> on a MicroStar K7TPro2-A motherboard with 1GB of RAM (RackSaver 1200).
>
> RedHat 7.1 is loaded locally on each system. Also loaded  mpich-1.2.0-10.i386.rpm
> on each system and set up the rhosts/hosts.equiv to allow all the rsh stuff...
>
> Systems are interconnected with Intel 10/100 Ethernet cards.
>
> One of the research PhD's in my group has a program that has run successfully on
> other supercomputer-class systems (Cray and SGI). Very CPU-intensive, but
> does nothing fancy other than using MPI for communication (very little disk I/O,
> etc.).
>
> /home file system is NFS mounted on each system. I've tried NFS server is the master
> node or another system outside the cluster.
>
> Even though this code runs as a normal user (not root), it will hard-hang the
> "master" node in about 10 minutes. "Hard-hang" means nothing on console, disk light on
> solid, doesn't respond to reset or power switches- have to reset by pulling plug.
>
> I've tried the stock 2.4.2-2 kernel that loads with RedHat 7.1, I've tried the 2.4.2
> kernel recompiled to specifically call the CPU an Athlon, and I've tried
> downloading/using the 2.4.4 kernel.  All of my attempts produce the same result-
> his program can crash the system every time it is run.
>
> I've searched the normal dejanews/altavista sites for Linux/Athlon/hang, but nothing
> interesting pops out. I must be missing something simple- the 2.4.X kernels
> can't be that unstable.
>
> Does this ring a bell with anyone in the group?

Hi, doubt this is the reason, but could be worth checking out...

/jon


----- Forwarded message from Theodore Tso <tytso at valinux.com> -----

Envelope-to: bfinley at valinux.com
Delivery-date: Mon, 21 May 2001 20:15:27 -0700
From: Theodore Tso <tytso at valinux.com>
To: tech at lists.valinux.com
Subject: [VA-Tech] [tytso at MIT.EDU: Re: repeated forced fsck]
Sender: tech-admin at lists.valinux.com

Warning, it looks like there may be some cases where Red Hat 7.1's
partitioning software may be producing corrupt partition tables.  This
can cause filesystem corruption on the root partition if you're using
LILO to boot your system.  Folks who are thinking about installing Red
Hat 7.1 may want to check and make sure their partition table looks
sane....

                                                - Ted


--

Envelope-to: tytso at localhost
Delivery-date: Mon, 21 May 2001 23:05:05 -0400
From: Theodore Tso <tytso at MIT.EDU>
To: eichin-oa at boxedpenguin.com
Cc: tytso at mit.edu
Subject: Re: repeated forced fsck

On Mon, May 21, 2001 at 08:43:50PM -0400, eichin-oa at boxedpenguin.com wrote:
> This sounds a little odd, but didn't you mention having a problem on a
> debian install with fsck happening on every boot?

I don't remembering seeing such a problem with a Debian install, but I
have seen this problem before, and I may have mentioned it to you.

The problem is that the Linux kernel uses the LBA value to determine
where the partition starts, but the LILO uses the CHS value to
determine the partition start location (which it has to since it's
using the BIOS functions) when it's writing out the first sector of
the LILO map file, which it does on each boot because of a desire to
make lilo -R a one-shot.

So doing this will cause filesystem corruption (or at least some kind
of corruption), since LILO will write out the map file to the wrong
place.

Whatever fdisk-like program Red Hat is using in 71, it's definitely
really, really buggy.  I'm surprised they didn't catch this in their
testing.

                                                - Ted

>
> ------- Start of forwarded message -------
> To: Simon Josefsson <simon+openafs-info at josefsson.org>
> Subject: Re: [OpenAFS] rh71, oafs 1.04: unloading unused kernel module crash machine
> Message-ID: <990430724.3b08c60494374 at mail1.nada.kth.se>
> From: tegner at nada.kth.se
> Cc: openafs-info at openafs.org
> References: <ilur8xl2vwg.fsf at barbar.josefsson.org>
> MIME-Version: 1.0
> Content-Type: text/plain; charset=iso-8859-1
> Content-Transfer-Encoding: 8bit
> Date: Mon, 21 May 2001 09:38:44 +0200 (MET DST)
>
> Probably not related, but have had disk problems with RH 7.1 (e.g. constantly
> forced fsck on reboot). This has happened on two (independent) machines which
> were upgraded from RH 6.2, and seems to be a result of incompatible LBA and CHS
> values. Obtained the following from Partition Magic
>
> ``Partition Magic has detected an error on the partition starting at sector
> 19390455 on disk 1. The starting LBA value is 19390455 and the CHS value is
> 16450559. The LBA value and the CHS value must be equal. Partition Magic has
> verified that the LBA value is correct and can fix the CHS value''.
>
> Have also experienced this on a fresh install of RH 7.1 on a machine with a 75
> Gb disk.
>
> /jon







More information about the Beowulf mailing list