Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re: Grid Engine, Parallel Environment, Scheduling, Myrinet, and MPICH

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

William Burke wburke999 at msn.com
Wed Mar 23 12:42:39 PST 2005


I can't get PE to work on a 50 node class II Beowulf. It has a front-end
Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running
Linux configured to communicate data over Myrinet using MPICH-GM version
1.26.14a. 

 

These are the requirements of the N1GE environment to handle: 

1.	Serial type jobs for pre-processing the data - average runtime 15
minutes. 
2.	Output is pipelined into parallel processing jobs - range of runtime
1- 6 hours. 
3.	Concurrently running is post-processing serial jobs. 

I have setup a Parallel Environment called mpich-gm and a straight-forward
FIFO scheduling schema for testing. When I submit parallel jobs they hang in
limbo in a 'qw' state pending submission. I am not sure why the scheduler
does not see jobs that I submit.  

 

I used the myrinet mpich template located $SGE_ROOT/< sge_cell >/mpi/myrinet
directory to configure the pe (parallel environment) plus I copied the
sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory.  I configured
a Production.q queue that runs only parallel jobs. As a last sanity check I
ran a trace on the scheduler, submitted a simple parallel job, and this is
the results that I got from the logs:

 

 

JOB RUN Window

[wems at wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++

Your job 277 ("hello++") has been submitted.

Waiting for immediate job to be scheduled.

 

Your qsub request could not be scheduled, try again later.

[wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++

Your job 278 ("hello++") has been submitted.

[wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++

Your job 279 ("hello++") has been submitted.

 

This is the 2nd window SCHEDULER LOG

[root at wems bin]# qconf -tsm

[root at wems bin]# qconf -tsm

[root at wems bin]# cat /WEMS/grid/default/common/schedd_runlog

Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN-------------

Wed Mar 23 06:08:55 2005|queue instance "all.q at wems10.grid.wni.com" dropped
because it is temporarily not available

Wed Mar 23 06:08:55 2005|queue instance "Production.q at wems10.grid.wni.com"
dropped because it is temporarily not available

Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not
available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com

Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on

Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN-------------

Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN-------------

Wed Mar 23 06:11:37 2005|queue instance "all.q at wems10.grid.wni.com" dropped
because it is temporarily not available

Wed Mar 23 06:11:37 2005|queue instance "Production.q at wems10.grid.wni.com"
dropped because it is temporarily not available

Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not
available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com

Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on

Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN-------------

[root at wems bin]# qstat

job-ID prior   name       user         state submit/start at     queue
slots ja-task-ID

----------------------------------------------------------------------------
-------------------------------------

    279 0.55500 hello++    wems         qw    03/23/2005 06:11:43
1

[root at wems bin]#

 

BTW that node wems10.grid.wni.com has connectivity issues and I have not
removed it from the cluster queue.  

 

What causes this type of problem in N1GE to return "no pending jobs to
perform scheduling on" in the schedd_runlog even though there are available
slots ready to take jobs?  

I had no problem submitting serial jobs, only the parallel jobs resulted as
such. Are there N1GE - Myrinet issue that I am not aware of?  FYI the same
binary (hello++) runs with no problems from the command line.

Since I generally run scripts from qsub instead of binaries I created a
script to run the mpich executable but that yield the same result.

 

I have an additional question regarding setting a queue.conf parameter
called "subordinate_list". How is it read from the result of qconf -mq
<queue_name>?

Example 

            i.e., subordinate_list     low_pri.q=5,small.q.

 

Which queue has priority over the other based on the slots?

 

 

William Burke

Tellitec Sollutions

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.scyld.com/pipermail/beowulf/attachments/20050323/611516b3/attachment.html


More information about the Beowulf mailing list