Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Andrew Wang andrewxwang at yahoo.com.tw
Fri Mar 25 06:51:25 PST 2005


Please send your question to the SGE mailing list:

http://gridengine.sunsource.net/project/gridengine/maillist.html

The "users" list is what you want.

BTW, you should try commands like "qstat -f", or
"qhost" to find out the status of the machines.

ALso, do serial jobs work?

Andrew.



 --- William Burke <wburke999 at msn.com> ªº°T®§¡G
> I can't get PE to work on a 50 node class II
> Beowulf. It has a front-end
> Sunfire v40 (qmaster host) and 49 Sunfire v20s
> (execution hosts) running
> Linux configured to communicate data over Myrinet
> using MPICH-GM version
> 1.26.14a. 
> 
>  
> 
> These are the requirements of the N1GE environment
> to handle: 
> 
> 1.	Serial type jobs for pre-processing the data -
> average runtime 15
> minutes. 
> 2.	Output is pipelined into parallel processing jobs
> - range of runtime
> 1- 6 hours. 
> 3.	Concurrently running is post-processing serial
> jobs. 
> 
> I have setup a Parallel Environment called mpich-gm
> and a straight-forward
> FIFO scheduling schema for testing. When I submit
> parallel jobs they hang in
> limbo in a 'qw' state pending submission. I am not
> sure why the scheduler
> does not see jobs that I submit.  
> 
>  
> 
> I used the myrinet mpich template located
> $SGE_ROOT/< sge_cell >/mpi/myrinet
> directory to configure the pe (parallel environment)
> plus I copied the
> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin
> directory.  I configured
> a Production.q queue that runs only parallel jobs.
> As a last sanity check I
> ran a trace on the scheduler, submitted a simple
> parallel job, and this is
> the results that I got from the logs:
> 
>  
> 
>  
> 
> JOB RUN Window
> 
> [wems at wems examples]$ qsub -now y -pe mpich-gm 1-4
> -b y hello++
> 
> Your job 277 ("hello++") has been submitted.
> 
> Waiting for immediate job to be scheduled.
> 
>  
> 
> Your qsub request could not be scheduled, try again
> later.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
> 
> Your job 278 ("hello++") has been submitted.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y
> hello++
> 
> Your job 279 ("hello++") has been submitted.
> 
>  
> 
> This is the 2nd window SCHEDULER LOG
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# cat
> /WEMS/grid/default/common/schedd_runlog
> 
> Wed Mar 23 06:08:55
> 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:08:55 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:08:55 2005|no pending jobs to perform
> scheduling on
> 
> Wed Mar 23 06:08:55
> 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37
> 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37 2005|queue instance
> "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queue instance
> "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queues dropped because they
> are temporarily not
> available: all.q at wems10.grid.wni.com
> Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:11:37 2005|no pending jobs to perform
> scheduling on
> 
> Wed Mar 23 06:11:37
> 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> [root at wems bin]# qstat
> 
> job-ID prior   name       user         state
> submit/start at     queue
> slots ja-task-ID
> 
>
----------------------------------------------------------------------------
> -------------------------------------
> 
>     279 0.55500 hello++    wems         qw   
> 03/23/2005 06:11:43
> 1
> 
> [root at wems bin]#
> 
>  
> 
> BTW that node wems10.grid.wni.com has connectivity
> issues and I have not
> removed it from the cluster queue.  
> 
>  
> 
> What causes this type of problem in N1GE to return
> "no pending jobs to
> perform scheduling on" in the schedd_runlog even
> though there are available
> slots ready to take jobs?  
> 
> I had no problem submitting serial jobs, only the
> parallel jobs resulted as
> such. Are there N1GE - Myrinet issue that I am not
> aware of?  FYI the same
> binary (hello++) runs with no problems from the
> command line.
> 
> Since I generally run scripts from qsub instead of
> binaries I created a
> script to run the mpich executable but that yield
> the same result.
> 
>  
> 
> I have an additional question regarding setting a
> queue.conf parameter
> called "subordinate_list". How is it read from the
> result of qconf -mq
> <queue_name>?
> 
> Example 
> 
>             i.e., subordinate_list    
> low_pri.q=5,small.q.
> 
>  
> 
> Which queue has priority over the other based on the
> slots?
> 
>  
> 
>  
> 
> William Burke
> 
> Tellitec Sollutions
> 
> > _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>  

_______________________________________________________________________
Yahoo!©_¼¯¹q¤l«H½c
§K¶O®e¶q250MB¡A«H¥ó¦b¦h¤]¤£©È
http://tw.promo.yahoo.com/mail_new/index.html



More information about the Beowulf mailing list