[Beowulf] after update sgeexecd not starting correctly on reboot
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at caltech.eduTue Nov 25 14:40:38 PST 2008
- Previous message: [Beowulf] tools for cluster event logging?
- Next message: [Beowulf] Re: after update sgeexecd not starting correctly on reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This is an odd one, and I hope one of you has seen it and fixed it, because the only way I have been able to trigger the bug is through a reboot. I updated one node from Mandriva 2007.1 to 2008.1. Those are both 2.6.x kernels, and are as you might guess about a year apart. Both use the exact same SGE distribution, which is NFS mounted on /usr/SGE6. On a reboot of the newer system, /etc/rc.d/init.d/sgeexecd, which is the last thing to start in runlevel 3 (except for S99local, which doesn't do anything except "touch /var/lock/subsys/local") fails. First it spews a bunch of lines which look like a script did "set", and as a side effect, this pushes all the other text lines off the console, and then it emits can't determine path to Grid Engine binaries without starting sge_execd. On the older system the exact same scipt starts up with none of this drama, leaving sge_execd running. However, once I logon as root at the console on the newer system, it happily starts up with: /etc/rc.d/init.d/sgeexecd start There are no SGE variables defined in .bashrc etc. The init script has these prerequisites, as on the older system: # Provides: sgeexecd # Required-Start: $network $remote_fs Ring any bells? I think maybe the NFS mounting is different, so that the remote_fs prerequisite isn't really satisfied, even though the associated script has run. The sgeexecd script does include a test: while [ ! -d "$SGE_ROOT" -a $count -le 120 ]; do count=`expr $count + 1` sleep 1 done but since SGE_ROOT is the mount point, the test will be true whether or not the NFS mount has completed. Maybe I'll change that to $SGE_ROOT/bin and see if it helps. Thanks, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] tools for cluster event logging?
- Next message: [Beowulf] Re: after update sgeexecd not starting correctly on reboot
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
