[Beowulf] SGE + LAM

Andrew Wang andrewxwang at yahoo.com.tw
Mon Aug 16 19:58:37 PDT 2004


Did you try the SGE mailing list? There are several
people using SGE+LAM on Linux.

Andrew.

 --- "C.L. Lai [ALAN]" <clai33 at uwo.ca> 的訊息:
> 
> I have been trying to do an SGE6+LAM7 integration,
> but no luck so far.
> 
> After a long conversation to LAM mailing list, I
> still don't know
> whether the problem is from my setting, LAM, SGE, or
> SGE+LAM, but some
> people pointed out an error about the rsh/rshd from
> SGE didn't work
> properly.
> 
> I am not getting any useful SGE log, here is some
> log generated by the
> sge-lam script:
> 
> This is 'sge-lam start'
> 
> SGE-LAM DEBUG: LAMHOME = /usr
> SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> SGE-LAM DEBUG: PATH =
>
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin
> SGE-LAM DEBUG: qrsh =
> /home/compute/sge/bin/lx26-amd64/qrsh
> SGE-LAM DEBUG: ARGV = ""
> SGE-LAM DEBUG: sgelamconf =
> /home/compute/sge/lam/sge-lam-conf.lamd
> SGE-LAM DEBUG: func=start
> SGE-LAM DEBUG: LAMBOOT ARGS: -nn -ssi boot rsh -ssi
> boot_rsh_agent
> /home/compute/sge/lam/sge-lam qrsh-remote -c
> /home/compute/sge/lam/sge-lam-conf.lamd -v -d
> /tmp/537.1.all.q/lamhostfile
> /tmp/537.1.all.q/lamhostfile
> SGE-LAM DEBUG: LAMHOSTSLIST: rational.math.uwo.ca
> cpu=2
> n0<24778> ssi:boot: Opening
> n0<24778> ssi:boot: looking for module named rsh
> n0<24778> ssi:boot: opening module rsh
> n0<24778> ssi:boot: initializing module rsh
> n0<24778> ssi:boot:rsh: module initializing
> n0<24778> ssi:boot:rsh:agent:
> /home/compute/sge/lam/sge-lam qrsh-remote
> n0<24778> ssi:boot:rsh:username: <same>
> n0<24778> ssi:boot:rsh:verbose: 1000
> n0<24778> ssi:boot:rsh:algorithm: linear
> n0<24778> ssi:boot:rsh:priority: 10
> n0<24778> ssi:boot: Selected boot module rsh
> n0<24778> ssi:boot:base: looking for boot schema in
> following directories:
> n0<24778> ssi:boot:base:   <current directory>
> n0<24778> ssi:boot:base:   $TROLLIUSHOME/etc
> n0<24778> ssi:boot:base:   $LAMHOME/etc
> n0<24778> ssi:boot:base:   /etc/lam
> n0<24778> ssi:boot:base: looking for boot schema
> file:
> n0<24778> ssi:boot:base:  
> /tmp/537.1.all.q/lamhostfile
> n0<24778> ssi:boot:base: found boot schema:
> /tmp/537.1.all.q/lamhostfile
> n0<24778> ssi:boot:rsh: found the following hosts:
> n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> (cpu=2)
> n0<24778> ssi:boot:rsh: resolved hosts:
> n0<24778> ssi:boot:rsh:   n0 rational.math.uwo.ca
> --> 129.100.75.80
> n0<24778> ssi:boot:rsh: starting RTE procs
> n0<24778> ssi:boot:base:linear: starting
> n0<24778> ssi:boot:base:server: opening server TCP
> socket
> n0<24778> ssi:boot:base:server: opened port 35804
> n0<24778> ssi:boot:base:linear: booting n0
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:rsh: starting lamd on
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:rsh: starting on n0
> (rational.math.uwo.ca): hboot -t -c
> /home/compute/sge/lam/sge-lam-conf.lamd -d -v
> -sessionsuffix sge-537-0 -I
> -H 129.100.75.80 -P 35804 -n 0 -o 0
> n0<24778> ssi:boot:rsh: launching locally
> n0<24778> ssi:boot:rsh: successfully launched on n0
> (rational.math.uwo.ca)
> n0<24778> ssi:boot:base:server: expecting connection
> from finite list
> n0<24778> ssi:boot:base:server: got connection from
> 0.0.0.0
>
-----------------------------------------------------------------------------
> The lamboot agent timed out while waiting for the
> newly-booted process
> to call back and indicated that it had successfully
> booted.
> 
> As far as LAM could tell, the remote process started
> properly, but
> then never called back.  Possible reasons that this
> may happen:
> 
>         - There are network filters between the
> lamboot agent host and
>           the remote host such that communication on
> random TCP ports
>           is blocked
>         - Network routing from the remote host to
> the local host isn't
>           properly configured (this is uncommon)
> 
> You can check these things by watching the output
> from "lamboot -d".
> 
> 1. On the command line for hboot, there are two
> important parameters:
>    one is the IP address of where the lamboot agent
> was invoked, the
>    other is the port number that the lamboot agent
> is expecting the
>    newly-booted process to call back on (this will
> be a random
>    integer).
> 
> 2. Manually login to the remote machine and try to
> telnet to the port
>    indicated on the hboot command line.  For
> example, 
>        telnet <ipnumber> <portnumber>
>    If all goes well, you should get a "Connection
> refused" error.  If
>    you get any other kind of error, it could
> indicate either of the
>    two conditions above.  Consult with your
> system/network
>    administrator.
>
-----------------------------------------------------------------------------
> n0<24778> ssi:boot:base:server: failed to connect to
> remote lamd!
> n0<24778> ssi:boot:base:server: closing server
> socket
> n0<24778> ssi:boot:base:linear: aborted!
>
-----------------------------------------------------------------------------
> lamboot encountered some error (see above) during
> the boot process,
> and will now attempt to kill all nodes that it was
> previously able to
> boot (if any).
> 
> Please wait for LAM to finish; if you interrupt this
> process, you may
> have LAM daemons still running on remote nodes.
>
-----------------------------------------------------------------------------
> lamboot did NOT complete successfully
> 
> 
> 
> This is 'sge-lam qrsh-local'
> 
> SGE-LAM DEBUG: LAMHOME = /usr
> SGE-LAM DEBUG: SGE_ROOT = /home/compute/sge
> SGE-LAM DEBUG: PATH =
>
/tmp/537.1.all.q:/usr/local/bin:/usr/ucb:/bin:/usr/bin::/home/compute/sge/bin/lx26-amd64:/usr/bin:/home/compute/sge/bin/lx26-amd64:/usr/bin
> SGE-LAM DEBUG: qrsh =
> /home/compute/sge/bin/lx26-amd64/qrsh
> SGE-LAM DEBUG: ARGV =
> "/usr/bin/lamd" "-H" "129.100.75.80" "-P" "35804"
> "-n" "0" "-o" "0" "-d" "-sessionsuffix" "sge-537-0"
> SGE-LAM DEBUG: sgelamconf =
> /home/compute/sge/lam/sge-lam-conf.lamd
> SGE-LAM DEBUG: func=qrsh-local
> SGE-LAM DEBUG: QRSH LOCAL CONFIG: -inherit -nostdin
> -V
> rational.math.uwo.ca /usr/bin/lamd -H 129.100.75.80
> -P 35804 -n 0 -o 0 -d
> -sessionsuffix sge-537-0
> SGE-LAM DEBUG: Exec qrsh-local:
> /home/compute/sge/bin/lx26-amd64/qrsh
> -inherit -nostdin -V rational.math.uwo.ca
> /usr/bin/lamd -H 129.100.75.80
> -P 35804 -n 0 -o 0 -d -sessionsuffix sge-537-0
> rcmd: socket: Permission denied
> 
> 
> 
> The last line above is the line people think it's
> qrsh/rsh/rshd related.
> 
> 
> 
> %qconf -sp lam
> pe_name           lam
> slots             100
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /home/compute/sge/lam/sge-lam
> start
> stop_proc_args    /home/compute/sge/lam/sge-lam stop
> allocation_rule   $fill_up
> control_slaves    TRUE
> job_is_first_task FALSE
> urgency_slots     min
> 
> 
> Thanks,
> 
=== message truncated ===> #!/usr/bin/perl
> 
> ### INSTALL DIRECTIONS:
> #
> #  1. Install this PERL executable, sge-lam inside
> the LAM bin dir. 
> #     Make sure it is executable.
> #  2. Modify the following variables: LAMHOME below
> to fit your site setup. 
> #
> 
> $LAMHOME="/usr";
> 
> #  3. Create an SGE PE that can be used to submit
> lam jobs. The following 
> #     is an example assuming the scripts exist in
> /usr/local/lam/bin. 
> #     You should replace the queue_list and slots
> with your site specific 
> #     values or set it to "all" to use all the
> queues.  
> #
> #        % qconf -sp lammpi 
> #        pe_name lammpi
> #        queue_list all
> #        slots 6
> #        user_lists NONE
> #        xuser_lists NONE
> #        start_proc_args /usr/local/lam/bin/sge-lam
> start
> #        stop_proc_args /usr/local/lam/bin/sge-lam
> stop
> #        allocation_rule $fill_up
> #        control_slaves TRUE
> #        job_is_first_task FALSE
> #
> #    NOTE: It is probably easiest to use the qmon
> GUI to create the PE.
> #
> #   4. Add a new LAM node process schema into the
> $LAMHOME/etc area
> #      named sge-lam-conf.lamd. This should be a
> single line that
> #      adds the "sge-lam qrsh-local" prefix to the
> lamd startup.
> #
> #       % cat /usr/local/lam/etc/sge-lam-conf.lamd
> #       /usr/local/lam/bin/sge-lam qrsh-local
> /usr/local/lam/bin/lamd  
> #         $inet_topo $debug $session_prefix
> $session_suffix
> #
> #### Submitting SGE JOBS
> #
> #   Once this is setup users can submit jobs as
> normal and should not need to 
> #   lamboot on their own. Users need only call
> mpirun for their MPI programs. 
> #   Here is an example job:
> #
> #        % cat lamjob.csh
> #        #$ -cwd
> #        set path=(/usr/local/lam/bin $path)
> #        echo "Starting my LAM MPI job"
> #        mpirun C conn-60
> #        echo "LAM MPI job done"
> #
> #
> #
> #### Comments/Issues email:
> christopher.duncan at xxxxxxx
> #
> # END INSTALL
> 
> 
> $verbose=1;
> #$debug=0;
> $debug=1;
> 
> # close STDIN to avoid stdio race conditions and tty
> issues
> close(STDIN);
> 
> if( $debug eq 1){
> 	open(SGEDEBUG,"> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> 	select(SGEDEBUG); $|=1;
> 	open(STDERR,">> /tmp/sgedebug.$ENV{JOB_ID}.$$");
> }
> 
> # set output for stderr and stdout to be unbuffered
> select(STDERR); $|=1;
> select(STDOUT); $|=1;
> 
> $lamboot="$LAMHOME/bin/lamboot";
> $lamhalt="$LAMHOME/bin/lamhalt";
> #$sgelamconf="${SGE_ROOT}/lam/sge-lam-conf.lamd";
> 
> # read in the args to figure out our task
> $func=shift @ARGV;
> 
> $SGE_ROOT="$ENV{SGE_ROOT}";
> $sgelamconf="$SGE_ROOT/lam/sge-lam-conf.lamd";
> 
> 
> $arch=`${SGE_ROOT}/util/arch`;
> chomp($arch);
> $qrsh="${SGE_ROOT}/bin/${arch}/qrsh";
> 
> # add LAM and SGE to path
> $ENV{'PATH'}.=":${SGE_ROOT}/bin/${arch}";
> $ENV{'PATH'}.=":${LAMHOME}/bin";
> 
> #debug_print("TMPDIR = $ENV{TMPDIR}");
> debug_print("LAMHOME = $LAMHOME");
> debug_print("SGE_ROOT = $SGE_ROOT");
> debug_print("PATH = $ENV{PATH}");
> debug_print("qrsh = $qrsh");
> debug_print("ARGV = \"".join("\" \"", at ARGV)."\"");
> debug_print("sgelamconf = $sgelamconf");
> 
> if("$func" eq "start"){
> 	debug_print("func=start");
> 	print "Starting SGE + LAM Integration\n";
> 	print "\t using tight integration scheme\n";
> 	start_proc_args();
> }elsif("$func" eq "stop"){
> 	debug_print("func=stop");
> 	print "Stoping SGE + LAM Integration\n";
> 	stop_proc_args();
> }elsif("$func" eq "qrsh-remote"){
> 	debug_print("func=qrsh-remote");
>         qrsh_remote();
> }elsif("$func" eq "qrsh-local"){
> 	debug_print("func=qrsh-local");
>         qrsh_local();
> }else{
> 	print STDERR "\nusage: $0 {start|stop}\n\n";	
> 	exit(-1);
> }
> 
> 
> sub start_proc_args()
> {
> 
>   # we currently place the LAM host file in the
> TMPDIR that SGE uses.
>   # if we place it elsewhere we need to clean it up
>   $lamhostsfile="$ENV{TMPDIR}/lamhostfile";
> 
>   # flags and options for lamboot (-x, -s and -np
> may be useful in some envs)
>  
>
@lambootargs=("-nn","-ssi","boot","rsh","-ssi","boot_rsh_agent","$SGE_ROOT/lam/sge-lam
> qrsh-remote","-c","$sgelamconf");
>   if($verbose){ push(@lambootargs,"-v"); }
>   if($debug){ push(@lambootargs,"-d"); }
>   push(@lambootargs,"$lamhostsfile");
>   debug_print("LAMBOOT ARGS: @lambootargs
> $lamhostsfile");
> 
>   ### Need to convert the SGE hostfile to a LAM
> hostfile format
>   # open and read the PE hostfile
>   #system("cp $pe_hostfile /tmp");
> 
>   open(SGEHOSTFILE,"< $ENV{PE_HOSTFILE}");
>   # convert to LAM bhost file format
>   @lamhostslist=();
>   while(<SGEHOSTFILE>){
> 	($host,$ncpu,$junk)=split(/\s+/);
> 	push( @lamhostslist,"$host cpu=$ncpu");
>   }
>   close(SGEHOSTFILE);
> 
>   debug_print("LAMHOSTSLIST: @lamhostslist");
>   # create the new lam bhost file
>   open(LAMHOSTFILE,"> $lamhostsfile");
>   print LAMHOSTFILE join("\n", at lamhostslist);
>   print LAMHOSTFILE "\n";
>   close(LAMHOSTFILE);
> 
> 
>   if($debug){ close(SGEDEBUG); }
>   debug_print("Exec Lamboot: $lamboot
> @lambootargs");
>   exec($lamboot, at lambootargs);
> }
> 
> 
> sub stop_proc_args(){
> 
>   if($verbose){ push(@lamhaltargs,"-v"); }
>   if($debug){ push(@lamhaltargs,"-d"); }
> 
> #  if($debug){ close(SGEDEBUG); }
>   debug_print("Exec Lamhalt: $lamhalt
> @lamhaltargs");
>   exec($lamhalt, at lamhaltargs);
> }
> 
> 
> 
=== message truncated ===>
_______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

-----------------------------------------------------------------
Yahoo!奇摩Messenger6.0
即時通送你巴里島六人行!
http://tw.messenger.yahoo.com/promo/2004/mgm/index.html



More information about the Beowulf mailing list