Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: python2.4 error when loose MPICH2 TI with Grid Engine

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Reuti reuti at staff.uni-marburg.de
Sun Mar 2 01:45:06 PST 2008


Hi,

Am 22.02.2008 um 09:23 schrieb Sangamesh B:

> Dear Reuti & members of beowulf,
>
> I need to execute a parallel job thru grid engine.
>
> MPICH2 is installed with Process Manager:mpd.
>
> Added a parallel environment MPICH2 into SGE:
>
> $ qconf -sp MPICH2
> pe_name           MPICH2
> slots             999
> user_lists        NONE
> xuser_lists       NONE
> start_proc_args   /share/apps/MPICH2/startmpi.sh -catch_rsh  
> $pe_hostfile
> stop_proc_args    /share/apps/MPICH2/stopmpi.sh
> allocation_rule   $pe_slots
> control_slaves    FALSE
> job_is_first_task TRUE
> urgency_slots     min
>
>
> Added this PE to the default queue: all.q.
>
> mpdboot is done. mpd's are running on two nodes.
>
> The script for submitting this job thru sge  is:
>
> $ cat subsamplempi.sh
> #!/bin/bash
>
> #$ -S /bin/bash
>
> #$ -cwd
>
> #$ -N Samplejob
>
> #$ -q all.q
>
> #$ -pe MPICH2 4
>
> #$ -e ERR_$JOB_NAME.$JOB_ID
>
> #$ -o OUT_$JOB_NAME.$JOB_ID
>
> date
>
> hostname
>
> /opt/MPI_LIBS/MPICH2-GNU/bin/mpirun -np $NSLOTS -machinefile  
> $TMP_DIR/machines ./samplempi
>
> echo "Executed"
>
> exit 0
>
>
> The job is getting submitted, but not executing. The error and  
> output file contain:
>
> cat ERR_Samplejob.192
> /usr/bin/env: python2.4: No such file or directory
>
> $ cat OUT_Samplejob.192
> -catch_rsh /opt/gridengine/default/spool/compute-0-0/active_jobs/ 
> 192.1/pe_hostfile
> compute-0-0
> compute-0-0
> compute-0-0
> compute-0-0
> Fri Feb 22 12:57:18 IST 2008
> compute-0-0.local
> Executed
>
> So the problem is coming for python2.4.
>
> $ which python2.4
> /opt/rocks/bin/python2.4
>
> I googled this error. Then created a symbolic link:
>
> # ln -sf /opt/rocks/bin/python2.4 /bin/python2.4
>
> After this also same error is coming.
>
> I guess the problem might be different. i.e. gridengine might not  
> getting the link to running mpd.
>
> And the procedure followed by me to configure PE might be wrong.
>
> So, I expect from you to clear my doubts and help me to resolve  
> this error.
>
> 1. Is the PE configuration of MPICH2 + grid engine right?

if you want to integrate MPICH2 with MPD it's similar to a PVM setup.  
The daemons must be started in start_proc_args on every node with a  
dedicated port number per job. You don't say what your startmpi.sh is  
doing.

> 2. Without Tight integration, is there  a way to run a MPICh2(mpd)  
> based job using gridengine?

Yes.

> 3. In smpd-daemon based and daemonless MPICH2 tight integration,  
> which one is better?

Depends: if you have just one mpirun per job which will run for days,  
I would go for the daemonless startup. But if you issue many mpirun  
calls in your jobscript which will just run for seconds I would go  
for the daemon based startup, as the mpirun will be distributed to  
the slaves faster.

> 4. Can we do mvapich2 tight integration with SGE? Any differences  
> with process managers wrt MVAPICH2?

Maybe, if the startup is similar to standard MPICH2.

-- Reuti


> Thanks & Best Regards,
> Sangamesh B

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20080302/a3484f8f/attachment.html


More information about the Beowulf mailing list