su segfault (core dump) problem.

Mon Jan 22 14:18:41 PST 2001

On Mon, 22 Jan 2001, Georgia Southern Beowulf Cluster Project wrote:

> Hello,
>
> I'm running 15 diskless nodes attached to another node using NFS to export
> to each node: a / filesystem, a shared /home filesystem, and a shared /usr
> filesystem.  This is a new setup I just implimented (mostly sharing /usr as
> before each node had its own instead of sharing).  I'm running RH 6.2 with
> PVM3 and I've been testing it with pvmpovray.  As of right now, I can rsh
> into all nodes from any other node (security is not an issue).  However,
> when I su to become root and do administrative tasks all nodes will now
> segfault and sometimes (not always) cause a core dump.  Now, I'm far from
> knowledgeable about cores, but does anyone have any suggestions or previous
> experiences with similar problems.  This did not happen before, when all
> nodes had individual /usr directories (this made software addition
> horrible).

Are there any key files opened by programs you are doing maintenance on
that are inadvertently shared by and used by all the different nodes?
Unfortunately there is plenty of non-FHS compliant programs out there --
anything that you are running that is (for example) opening a file in
/usr (presumed shared and static) instead of /var (presumed local and
volatile) is a candidate for such a problem if node a reads from and
writes to the file while node b is doing the same thing.  A shared /dev
can be equally evil the same way.  Don't assume that even RH binaries
are all "clean" in this regard -- remember that they package source code
from literally hundreds of folks and there is no guarantee that it is
all FHS compliant or even sane.  Plenty of programs assume that you can
write to anything is if it is local because on the developer's system it
is!

Second point is that you might want to check out e.g.:

   ftp://ftp.yellowdoglinux.com/pub/yellowdog/software/yup/

(from a recent freshmeat posting).  yup promises to be the long-awaited
automagical package maintainer for RPM's.  It is easy enough to install
RH via kickstart so that all nodes are identical -- just use a common
kickstart file and use dhcpd to distribute node identities on the basis
of MAC address.  It is harder to keep them that way.  yup allows you to
automagically synchronize RPM-based host descriptions and handles that
eternally annoying dependency tree problem for you.  It goes and finds
new packages AND all their dependencies and updates everything
necessary.  I've only seen it demonstrated a few times (and haven't
installed it myself yet) but the demos were awesome.  It makes managing
updates of installed packages absolutely trivial and automatic.

In other words, there are new tools on the horizon that will radically
improve one's ability to scalably manage RPM-based clusters of all
sorts, from beowulfs up to simple departmental clusters of workstations.

The last remark is that you might want to look over the core dumps
themselves to see what binary is producing them.  In elder days I could
do this with adb.  gdb doesn't seem to have a command for that tells you
the name of the crashed program (or if it does I don't know it and
haven't been able to figure it out -- please feel free to Enlighten me,
anybody who knows).  However "strings core | less" will usually work
well enough -- it certainly reveals to me that most of my personal local
core files come from netscape (sigh).

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu