[Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH

Andrew Wang andrewxwang at yahoo.com.tw
Fri Mar 25 06:51:25 PST 2005


Please send your question to the SGE mailing list:

http://gridengine.sunsource.net/project/gridengine/maillist.html

The "users" list is what you want.

BTW, you should try commands like "qstat -f", or
"qhost" to find out the status of the machines.

ALso, do serial jobs work?

Andrew.



 --- William Burke <wburke999 at msn.com> 的訊息:
> I can't get PE to work on a 50 node class II
> Beowulf. It has a front-end
> Sunfire v40 (qmaster host) and 49 Sunfire v20s
> (execution hosts) running
> Linux configured to communicate data over Myrinet
> using MPICH-GM version
> 1.26.14a. 
> 
>  
> 
> These are the requirements of the N1GE environment
> to handle: 
> 
> 1.	Serial type jobs for pre-processing the data -
> average runtime 15
> minutes. 
> 2.	Output is pipelined into parallel processing jobs
> - range of runtime
> 1- 6 hours. 
> 3.	Concurrently running is post-processing serial
> jobs. 
> 
> I have setup a Parallel Environment called mpich-gm
> and a straight-forward
> FIFO scheduling schema for testing. When I submit
> parallel jobs they hang in
> limbo in a 'qw' state pending submission. I am not
> sure why the scheduler
> does not see jobs that I submit.  
> 
>  
> 
> I used the myrinet mpich template located
> $SGE_ROOT/< sge_cell >/mpi/myrinet
> directory to configure the pe (parallel environment)
> plus I copied the
> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin
> directory.  I configured
> a Production.q queue that runs only parallel jobs.
> As a last sanity check I
> ran a trace on the scheduler, submitted a simple
> parallel job, and this is
> the results that I got from the logs:
> 
>  
> 
>  
> 
> JOB RUN Window
> 
> [wems at wems examples]$ qsub -now y -pe mpich-gm 1-4
> -b y hello++
> 
> Your job 277 ("hello++") has been submitted.
> 
> Waiting for immediate job to be scheduled.
> 
>  
> 
> Your qsub request could not be scheduled, try again
> later.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
> 
> Your job 278 ("hello++") has been submitted.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
> 
> Your job 279 ("hello++") has been submitted.
> 
>  
> 
> This is the 2nd window SCHEDULER LOG
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# cat
> /WEMS/grid/default/common/schedd_runlog
> 
> Wed Mar 23 06:08:55
> 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:08:55 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:08:55 2005|no pending jobs to perform
> scheduling on
> 
> Wed Mar 23 06:08:55
> 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37
> 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:11:37 2005|no pending jobs to perform
> scheduling on
> 
> Wed Mar 23 06:11:37
> 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> [root at wems bin]# qstat
> 
> job-ID prior   name       user         state
> submit/start at     queue
> slots ja-task-ID
> 
>
----------------------------------------------------------------------------
> -------------------------------------
> 
>     279 0.55500 hello++    wems         qw   
> 03/23/2005 06:11:43
> 1
> 
> [root at wems bin]#
> 
>  
> 
> BTW that node wems10.grid.wni.com has connectivity
> issues and I have not
> removed it from the cluster queue.  
> 
>  
> 
> What causes this type of problem in N1GE to return
> "no pending jobs to
> perform scheduling on" in the schedd_runlog even
> though there are available
> slots ready to take jobs?  
> 
> I had no problem submitting serial jobs, only the
> parallel jobs resulted as
> such. Are there N1GE - Myrinet issue that I am not
> aware of?  FYI the same
> binary (hello++) runs with no problems from the
> command line.
> 
> Since I generally run scripts from qsub instead of
> binaries I created a
> script to run the mpich executable but that yield
> the same result.
> 
>  
> 
> I have an additional question regarding setting a
> queue.conf parameter
> called "subordinate_list". How is it read from the
> result of qconf -mq
> <queue_name>?
> 
> Example 
> 
>             i.e., subordinate_list    
> low_pri.q=5,small.q.
> 
>  
> 
> Which queue has priority over the other based on the
> slots?
> 
>  
> 
>  
> 
> William Burke
> 
> Tellitec Sollutions
> 
> > _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

_______________________________________________________________________
Yahoo!奇摩電子信箱
免費容量250MB,信件在多也不怕
http://tw.promo.yahoo.com/mail_new/index.html



More information about the Beowulf mailing list