[Beowulf]

Wed Mar 23 15:25:30 PST 2005

Hi,

I'd suggest to move over to the SGE users list at: 
http://gridengine.sunsource.net/servlets/ProjectMailingListList

But anyway, let's sort the things out:

Quoting William Burke <wburke999 at msn.com>:

> I can't get PE to work on a 50 node class II Beowulf. It has a front-end
> Sunfire v40 (qmaster host) and 49 Sunfire v20s (execution hosts) running
> Linux configured to communicate data over Myrinet using MPICH-GM version
> 1.26.14a. 

Although there is a special Myrinet directory, you can also try to use the 
files in the mpi directory instead.

> These are the requirements of the N1GE environment to handle: 
> 
> 1.	Serial type jobs for pre-processing the data - average runtime 15
> minutes. 
> 2.	Output is pipelined into parallel processing jobs - range of runtime
> 1- 6 hours. 
> 3.	Concurrently running is post-processing serial jobs. 
> 
> I have setup a Parallel Environment called mpich-gm and a straight-forward
> FIFO scheduling schema for testing. When I submit parallel jobs they hang
> in
> limbo in a 'qw' state pending submission. I am not sure why the scheduler
> does not see jobs that I submit.  
> 
>  
> 
> I used the myrinet mpich template located $SGE_ROOT/< sge_cell
> >/mpi/myrinet
> directory to configure the pe (parallel environment) plus I copied the
> sge_mpirun script to the $SGE_ROOT/< sge_cell >/bin directory.  I
> configured
> a Production.q queue that runs only parallel jobs. As a last sanity check I
> ran a trace on the scheduler, submitted a simple parallel job, and this is
> the results that I got from the logs:

Can you please give more details of your queue and PE setup (qconf -sq/sp 
output).

> JOB RUN Window
> 
> [wems at wems examples]$ qsub -now y -pe mpich-gm 1-4 -b y hello++
> 
> Your job 277 ("hello++") has been submitted.
> 
> Waiting for immediate job to be scheduled.
> 
>  
> 
> Your qsub request could not be scheduled, try again later.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
> 
> Your job 278 ("hello++") has been submitted.
> 
> [wems at wems examples]$ qsub -pe mpich-gm 1-4 -b y hello++
> 
> Your job 279 ("hello++") has been submitted.

You can't start a parallel job this way, as there is no mpirun used. When you 
used your mentioned script, you get the same behavior (and there you used 
mpirun -np $NSLOTS ...)?

> This is the 2nd window SCHEDULER LOG
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# qconf -tsm
> 
> [root at wems bin]# cat /WEMS/grid/default/common/schedd_runlog
> 
> Wed Mar 23 06:08:55 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:08:55 2005|queue instance "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queue instance "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:08:55 2005|queues dropped because they are temporarily not
> available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:08:55 2005|no pending jobs to perform scheduling on
> 
> Wed Mar 23 06:08:55 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37 2005|-------------START-SCHEDULER-RUN-------------
> 
> Wed Mar 23 06:11:37 2005|queue instance "all.q at wems10.grid.wni.com" dropped
> because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queue instance "Production.q at wems10.grid.wni.com"
> dropped because it is temporarily not available
> 
> Wed Mar 23 06:11:37 2005|queues dropped because they are temporarily not
> available: all.q at wems10.grid.wni.com Production.q at wems10.grid.wni.com
> 
> Wed Mar 23 06:11:37 2005|no pending jobs to perform scheduling on
> 
> Wed Mar 23 06:11:37 2005|--------------STOP-SCHEDULER-RUN-------------
> 
> [root at wems bin]# qstat
> 
> job-ID prior   name       user         state submit/start at     queue
> slots ja-task-ID
> 
> ----------------------------------------------------------------------------
> -------------------------------------
> 
>     279 0.55500 hello++    wems         qw    03/23/2005 06:11:43
> 1
> 
> [root at wems bin]#

Do you have an admin account for SGE? I'd prefer not to do anything in SGE as 
root.

> BTW that node wems10.grid.wni.com has connectivity issues and I have not
> removed it from the cluster queue.  
> 
>  
> 
> What causes this type of problem in N1GE to return "no pending jobs to
> perform scheduling on" in the schedd_runlog even though there are available
> slots ready to take jobs?  
> 
> I had no problem submitting serial jobs, only the parallel jobs resulted as
> such. Are there N1GE - Myrinet issue that I am not aware of?  FYI the same
> binary (hello++) runs with no problems from the command line.

If you just start hello++, it will not run in parallel I think.

Not really an issue: you have to make a small change to the mpirun.ch_gm.pl to 
make all jobs staying in the same process group to get them correctly killed in 
case of a jobb abort:

http://gridengine.sunsource.net/howto/mpich-integration.html

> Since I generally run scripts from qsub instead of binaries I created a
> script to run the mpich executable but that yield the same result.
> 
>  
> 
> I have an additional question regarding setting a queue.conf parameter
> called "subordinate_list". How is it read from the result of qconf -mq
> <queue_name>?
> 
> Example 
> 
>             i.e., subordinate_list     low_pri.q=5,small.q.

The queue "low_pri.q" will be suspended, when 5 or more slots of "<queue_name>" 
are filled. The "small.q" will be suspened, if all slots of "<queue_name>" are 
filled.

Cheers - Reuti