bproc+autofs: oz_pgrp problem

hanzl at noel.feld.cvut.cz hanzl at noel.feld.cvut.cz
Thu May 9 03:22:15 PDT 2002


Hi,

I investigated problems with autofs on bproc node and I believe there
is deadlock caused by interference of the way bproc handles process
groups and autofs use them. 

Here is a minimalistic way to invoke the problem:

  modprobe --node 1 autofs
  bpsh 1 mkdir -p /usr/lib/autofs
  for x in /usr/lib/autofs/*; do bpcp $x 1:$x; done
  bpsh 1 touch /etc/yyy
  bpsh 1 automount /xxx file /etc/yyy
  bpsh 1 ps|grep 'automount'

  ...  22113 ?        00:00:00 automount

  bpsh 1 strace -p 22113

And in another window:

  bpsh 1 ls /xxx/zzz

If you also have this problem, ls hangs, only -9 kills automount and
strace prints:

  read(4, "\3\0\0\0\0\0\0\0\2\0\0\0\3\0\0\0zzz\0\0\0\0\0\0\0\0\0\0"..., 272) = 272
  chdir("/xxx")                           = 0
  lstat64("zzz", 

With a lot of guesswork (as I am no expert in this) I deduced this
scenario:

1. automount process finds out its own process group using getpgrp()

2. it mounts itself, using mount executable with pgrp option set to
   the result of getpgrp()

3. kernel performs mount and stores pgrp as oz_pgrp: processes with
   this 'magic' pgrp can see raw directories instead of automounted
   ones

4. when automounted subdirectory is accessed, kernel writes pipe to
   automount asking it to mount what should be seen there

5. automount tests whether automounted subdirectory exists; as it
   belongs to oz_pgrp, it should see raw directories

6. kernel fails to recognize that automount belongs to oz_pgrp and
   instead of showing raw directories, it wants to ask automount
   itself to deliver what should be seen. Deadlock.

I believe that the problem is that getpgrp() on bproc node returns
something else than current->pgrp tested inside the kernel on bproc
node in autofs_oz_mode().

In more details, the scenario looks like this:

1. in autofs-4.0.0pre10/daemon/automount.c:

   /* Make our own process group for "magic" reason: processes that share
      our pgrp see the raw filesystem behine the magic.  So if we are a 
      submount, don't change -- otherwise we won't be able to actually
      perform the mount.  A pgrp is also useful for controlling all the
      child processes we generate. */
   if ( !submount && setpgrp() ) {
     syslog(LOG_CRIT, "setpgrp: %m");
     exit(1);
   }
   my_pgrp = getpgrp();

 (I am not sure whether setpgrp() is called or not but I think it
 might not matter now. It may however cause more problems beyond the
 one described here.)

2. still in autofs-4.0.0pre10/daemon/automount.c, in mount_autofs(...):

    sprintf(options, "fd=%d,pgrp=%u,minproto=2,maxproto=%d", pipefd[1],
  	    (unsigned)my_pgrp, AUTOFS_MAX_PROTO_VERSION);
    sprintf(our_name, "automount(pid%u)", (unsigned)my_pid);
    
    if (spawnl(LOG_DEBUG, PATH_MOUNT, PATH_MOUNT, "-t", "autofs", "-o",
  	       options, our_name, path, NULL) != 0) {
      syslog(LOG_CRIT, "cannot find autofs in kernel");

3. in kernel in linux/fs/autofs/inode.c, autofs_read_super():

    if ( parse_options(data,&pipefd,&root_inode->i_uid,&root_inode->i_gid,&s
      bi->oz_pgrp,&minproto,&maxproto) ) {
        printk("autofs: called with bogus options\n");
        goto fail_dput;
    }

  ("magic" group got to oz_pgrp)

4. in kernel in system call invoked by ls:

  kernel writes pipe to automount process, asking it to arrange things
  in /xxx/zzz. System call does not return to ls until automount does
  its work (and therefore does never return).

5. in autofs-4.0.0pre10/daemon/automount.c:

   static int handle_packet_missing(...)
     ...
     chdir(ap.path);
     if ( lstat(pkt->name,&st) == -1 ||
  	  (S_ISDIR(st.st_mode) && st.st_dev == ap.dev) ) {
       /* Need to mount or symlink */

  (lstat() should see raw directories, but will not and will hang)

6. in kernel in fs/autofs/autofs_i.h:

    /* autofs_oz_mode(): do we see the man behind the curtain?  (The
       processes which do manipulations for us in user space sees the raw
       filesystem without "magic".) */
    
    static inline int autofs_oz_mode(struct autofs_sb_info *sbi) {
      return sbi->catatonic || current->pgrp == sbi->oz_pgrp;
    }

  (automount should be recognized as "magic" with autofs_oz_mode()==1
   but instead gets the same treatment as ls in step 4. above. There
   is probably kernel lock around autofs things to happen, or maybe
   kernel even sends request to autofs via pipe - in any case it have to
   deadlock as automount still waits for lstat())

*******************************

I may be wrong with my analysis, I am no expert on any of the things
involved (bproc, autofs). Please correct me if I am wrong.

If I am right, there are several possible ways to avoid deadlock:

- Make modified autofs.o which is aware of bproc-related process group
  tricks - autofs_oz_mode() should test for the same value which is
  returned via getpgrp() (hope this can avoid node-head-node
  communication). We should also verify whether setpgrp() used in
  automount.c would work as expected.

- Start automount outside the distributed PID space. I am not sure how
  to do this, bproc is damn good in not letting you escape :-) - we
  could modify /etc/inittab on node and signal init process and have our
  automount process created this way?


Any opinions and suggestions are more than welcome, especially
comments on current handling of process groups in bproc (I know
nothing about it).

Thanks and Best Regards

Vaclav Hanzl


========= copy of my original message on beowulf maillist: ========

Subject: autofs mount on bproc node?
From: hanzl
To: beowulf at beowulf.org
Date: Wed, 08 May 2002 17:18:39 +0200

Hi,

any of you great gurus managed to use autofs on bproc nodes?


I am pushing hard, but at this moment it looks like hitting concrete
wall with my head... any help would be more than welcome.

My nodes already run NFS client, NFS server and syslogd/klogd.
Automount seemes to start OK but when I ls automounted directory (node
is client, head is server), ls hangs, automount process hangs (kill -9
needed to kill it) and there is no error message anywhere.

I have syslogd working on node and automount is full of syslog()
calls, but in this case is says nothing, it probably hangs early when
receiving automount request from kernel.

The same automount setup works when head is client and node is server.

(Is there any way to force compiled kernel to give out more debug
messages? E.g. write somewhere to /proc?)


Running daemons on bproc nodes is tricky and probably worth
mini-howto. (Yes, I try to avoid daemons on nodes, but sometimes I
really need them.) If you know any related documents, please let me
know.

Thanks

Vaclav

-------------------------------------------------------

My setup:

RedHat 7.2 with most rpm updates till Nov 2001, Clustermatic (March
2002 version), kernel 2.4.18-lanl.16, automount version 3.1.7 (also
tested 4.0.0).

==> /etc/auto.master <==
# /etc/auto.master
/nfs  /etc/auto.nfs  rw,intr,rsize=8192,wsize=8192

==> /etc/auto.nfs <==
# /etc/auto.nfs
*  -fstype=autofs,-Dhost=&  file:/etc/auto.sub

==> /etc/auto.sub <==
# /etc/auto.sub
*  ${host}:/&

==> /etc/beowulf/syslog.conf <==
# syslog.conf for magi nodes

# log everything on screen:
*.*                                                     /dev/console

# log everything to head (magi):
*.*                                                     @10.0.4.1

==> /etc/beowulf/exports.node <==
# exports for beowulf node (experimentel)
/etc 10.0.4.1(ro)

==> /etc/beowulf/nsswitch.conf <==
passwd: bproc files
hosts: bproc

==> /etc/exports <==
# magi exports
/bin 10.0.4.0/255.255.255.0(ro)
/home noel(rw) 10.0.4.0/255.255.255.0(ro)
/lib 10.0.4.0/255.255.255.0(ro)
/sbin 10.0.4.0/255.255.255.0(ro)
/usr 10.0.4.0/255.255.255.0(ro)
/var 10.0.4.0/255.255.255.0(ro)

==> /etc/sysconfig/syslog <==
# Options to syslogd
# -m 0 disables 'MARK' messages.
# -r enables logging from remote machines
# -x disables DNS lookups on messages recieved with -r
# See syslogd(8) for more details
#SYSLOGD_OPTIONS="-m 0"
# VH: added -r for log from magi nodes to magi master:
SYSLOGD_OPTIONS="-m 0 -r"
# Options to klogd
# -2 prints all kernel oops messages twice; once for klogd to decode, and


== And my main experimental script: ==



#!/bin/bash

# beowulf node startup for magi

# New (RH7.2_Clustermatic/magi) version

echo This is /nfs/noel/home/hanzl/beowulf/startnode

########### CONFIG AREA ############

#nodenum:
N=1

#master IP:
MASTER=10.0.4.1

# directories to NFS-replicate from master (read only):

NFSDIRS='bin sbin usr'

# directories to just create on node:

CREATEDIRS='tmp nfs .autofsck var/lib/nfs/sm var/run var/lock/subsys var/nis'

# files to copy from master to node with identical pathname:

COPYFILES='/etc/auto.master /etc/auto.nfs /etc/auto.sub /etc/services /etc/rpc /etc/protocols /etc/passwd /etc/group'

# modules needed on nodes:

MODULES='sunrpc lockd nfs nfsd autofs'

####################################

echo Master IP: $MASTER, node number: $N

bpstat $N

echo Rebooting node $N...

bpctl -S $N -s reboot

while true; do
 STATUS=`bpstat $N -s`
 echo -n ' '$STATUS
 if [[ "$STATUS" == "up" ]]; then break; fi
 sleep 1
done

echo ''
echo Node $N is up!

####################################

echo "Creating any missing directories on node $N:"

for d in $NFSDIRS $CREATEDIRS; do
  echo -n ' '$d
  bpsh $N mkdir -p /$d
done

echo ''
echo ... done

####################################

echo "Inserting modules on node $N"

for m in $MODULES; do
  echo -n ' '$m
  modprobe --node $N $m
done

echo ''
echo ... done

####################################

echo "Copying config files to node $N"

echo "Files with identical pathname:"

for f in $COPYFILES; do
  echo -n ' '$f
  bpcp $f $N:$f
done

echo ''
echo ... done

echo "Files with different pathname:"

bpcp /etc/beowulf/syslog.conf $N:/etc/syslog.conf
bpcp /etc/beowulf/exports.node $N:/etc/exports
bpcp /etc/beowulf/nsswitch.conf $N:/etc/nsswitch.conf

echo ... done

####################################




####################################
echo "Hello to console"|bpsh $N dd of=/dev/console 2>/dev/null

####################################


echo "Starting basic daemons on node $N"


bpsh $N touch /var/lib/nfs/rmtab 
bpsh $N touch /var/lib/nfs/xtab  
bpsh $N touch /var/lib/nfs/etab

echo "portmap..."
bpsh $N portmap

echo "syslogd..."
bpsh $N syslogd

echo "klogd..."
bpsh $N klogd

bpsh $N logger "Starting basic rpc daemons on node $N"

echo "rpc.statd..."
# statd will run in /var/lib/nfs/statd as rpcuser and needs rw access
bpsh $N mkdir -p /var/lib/nfs/statd
bpsh $N chown rpcuser /var/lib/nfs/statd
bpsh $N chgrp rpcuser /var/lib/nfs/statd
bpsh $N rpc.statd 
bpsh $N touch /var/lock/subsys/nfslock

echo "Expected rpcinfo:" nlockmgr portmapper status
echo "  Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq`
echo "(maybe they did not start yet...)"
# 'nlockmgr' is provided by kernel module, not daemon

####################################
# (This could have been done even just after portmap, but is probably safer here)
echo "Mounting NFS directories, node $N is client, head is server"
bpsh $N logger "Mounting NFS directories, node $N is client, head is server"
for d in $NFSDIRS; do
  echo -n ' '$d
  bpsh 1 mount 10.0.4.1:/$d /$d
done
echo ''
echo ... done
####################################

echo "Expected rpcinfo:" nlockmgr portmapper status
echo "  Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq`



echo "Starting NFS server daemons on node $N"
bpsh $N logger "Starting NFS server daemons on node $N"

# We should provide: mountd nfs rquotad

RPCNFSDCOUNT=8

echo 'exportfs (not daemon)...'
bpsh $N exportfs -r

## echo "rpc.rquotad"
## bpsh $N rpc.rquotad
## # hangs (but shows in rpcinfo), avoid it

echo "rpc.mountd..."
# Special treatment needed: rpc.mountd with socket on stdin goes mad
# and therefore cannot be started by bpsh directly. Stdin redirect
# on node helps. You can either NFS-mount /usr/sbin or you can use
# new bproc, which delivers missing executables when absolute path
# is used.
bpsh $N sh -c '/usr/sbin/rpc.mountd </dev/null'

echo "rpc.nfsd (count=$RPCNFSDCOUNT)..."
bpsh $N rpc.nfsd $RPCNFSDCOUNT

bpsh $N touch /var/lock/subsys/nfs

echo "NFS server on node $N should work now"

####################################

echo "Expected rpcinfo:" mountd nfs nlockmgr portmapper status
echo "  Actual rpcinfo:" `bpsh $N rpcinfo -p|awk '{print $5}'|sort|uniq`

echo "Testing automount (head is client, node $N is server):"
echo "ls /nfs/n$N/etc:"
ls /nfs/n$N/etc

####################################

echo "Processes on node $N seen in head PID-space:"

ps aux|bpstat -P|grep ^$N'[^0-9]'

####################################

exit

echo "Starting autofs client on node $N"
bpsh $N logger "Starting autofs client on node $N"

bpsh $N /usr/sbin/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192
bpsh $N touch /var/lock/subsys/autofs

#bpsh $N sh -c '/usr/sbin/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192'

#bpsh $N sh -c '/root/autofs-4.0.0pre10/daemon/automount /nfs file /etc/auto.nfs rw,intr,rsize=8192,wsize=8192'

# bpsh $N ls /nfs/n-1/var wil HANG !!! :-(


==================== END =====================




More information about the Beowulf mailing list