[Beowulf] Implementation thru Sun Grid Engine

Sangamesh B forum.san at gmail.com
Sun Feb 24 02:53:02 PST 2008


Hi All,

  The following are the requirements, which should be implememted through
Sun Grid Engine:

1.    Q1 - One queue with all the cores (e.g, 2 nodes, 8 cores)
2.    Q2 & Q3 - 2 more queues under the large queue (6 cores & 2 cores)
3.    user1 & user2 allowed to submit job to Q2
4.    User3 & User4 allowed to submit job to Q3 only
5.    Also User1 is allowed to submit job in Q1. (Brustable)
6.    Can Application be only allowed to run through Q2?
7.    Can Application be allowed to run through Q3 & Q1?

The cluster has 2 systems(1 master + 1 node). Each is Dual core, Dual
Processor.
So 4 cores each.
Totally 8 cores.

In the following I explain, what I've done and what not..

Created queues:

q1: Hosts=2 Master+Node slots=4
q2: Hosts=2 Master+Node slots=3
q3: Hosts=1 Node slots=1

Users:
user1
user2
user3
user4

Usersets:
userset12
userset34

Also,
user1 & user2 belongs to userset12
user3 & user4 belongs to userset34


1.    Q1 - One queue with all the cores (e.g, 2 nodes, 8 cores)

    Created queue q1, with 2 nodes, slots=4

2.    Q2 & Q3 - 2 more queues under the large queue (6 cores & 2 cores)

    Mentioned above.
    How to do that these two queues should fall under q1, the large queue?

    May I know what are subordinate queues?

    If we make q2 and q3 subordinate to q1(q2 of 6 core & q3 of 2 core),
does it meet our requirement?

   If not, is it possible to do it in other way?


3.    user1 & user2 allowed to submit job to Q2

  I've given access  userset12 to q2. By this user1 and user2 can submit the
jobs to q2.

   and given userset34 as xuserset.

Now if user1 or user2 submit a parallel job of 6 mpi process, will it take 4
cores from master and 2 core from Node?

I tested it. 6 process job was not getting executed. But 3 process job got
executed.

The error is:
...
....

parallel environment:  mpiq2 range: 6
scheduling info:            has no permission for queue "
all.q at compute-0-0.local"
                            cannot run in queue instance "
q1 at compute-0-0.local" because it is not contained in its hard queue list
(-q)
                            has no permission for queue "
q3 at compute-0-0.local"
                            has no permission for queue "
san.q at compute-0-0.local"
                            has no permission for queue "
all.q at locuzcluster.local"
                            cannot run in queue instance "
q1 at locuzcluster.local" because it is not contained in its hard queue list
(-q)
                            has no permission for queue "
san.q at locuzcluster.local"
                            cannot run in PE "mpiq2" because it only offers
0 slots

Means, it is not running when mpi processes are more than 3.

May I know what went wrong here?


In case of serial jobs its working.If user1/user2 submits 6 serial jobs, 3
gets running on master and three on compute node. If a 7th job is submitted,
it will be queued & waiting and starts to run when the slot becomes free.

So the problem with parallel job has to be resolved.

4.    User3 & User4 allowed to submit job to Q3 only
         Given access to userset34 and xuserset=userset12

     If user3 or user4 submits a 2 process job, job gets submitted but
doesn't execute. Error is:

parallel environment:  mpiq34 range: 2
scheduling info:            has no permission for host "locuzcluster.local"
                            has no permission for host "compute-0-0.local"
                            cannot run in PE "mpiq34" because it only offers
0 slots


The PE config is as follows:

$ qconf -sp mpiq34
pe_name           mpiq34
slots             2
user_lists        userset34
xuser_lists       userset12
start_proc_args   /share/apps/MPICH2/startmpi.sh  -catch_rsh  $pe_hostfile
stop_proc_args    /share/apps/MPICH2/stopmpi.sh
allocation_rule   $fill_up
control_slaves    TRUE
job_is_first_task TRUE
urgency_slots     min

I'm not getting how to resolve this issue. There might be something wrong in
the settings.


5.    Also User1 is allowed to submit job in Q1. (Brustable)

   For this I've added user1 to the owner's list and userset34 as xuserset.

6.    Can Application be only allowed to run through Q2?
7.    Can Application be allowed to run through Q3 & Q1?

  Still these two has to be implemented.


Can anyone help me out to resolve above mentioned issues?


Thanks,
Sangamesh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20080224/efdbed3b/attachment.html>


More information about the Beowulf mailing list