[Beowulf] Do these SGE features exist in Torque?

Mon May 12 08:19:58 PDT 2008

Hiho,

Am 12.05.2008 um 15:14 schrieb Prentice Bisbal:

>>> At a previous job, I installed SGE for our cluster. At my current  
>>> job
>>> Torque is the queuing system of choice. I'm very familar with  
>>> SGE, but
>>> only have a cursory knowledge of Torque (installed it for  
>>> evaluation,
>>> and that's it). We're about to purchase a new cluster. I'd have  
>>> to make
>>> a good argument for using SGE over Torque. I was wondering if the
>>> following SGE features exist in Torque:
>>>
>>> 1. Interactive shells managed by queuing system
>>> 2. Counting licenses in use (done using a contributed shell  
>>> script in
>>> SGE)
>>> 3. Separation of roles between submit hosts, execution hosts, and
>>> administration hosts
>>> 4. Certificate-based security.
>>>
>>> Are there any notable features available in Torque that aren't  
>>> available
>>> in SGE?
>>
>> what you can find in Torque but not in SGE: request a mixture of  
>> nodes,
>> i.e. one heavy node with much memory (or big I/O options) and 5 nodes
>> with less memory or less disk performance for a parallel job.
>
> Huh? Can you elaborate? My initial thought is "why would you need
> this?", but I think I see where you're going with this...

For some types of applications only the "master" of a parallel job is  
collecting all the information and accessing the disk to store the  
results. Therefore this, and only this, machine needs better I/O  
capabilities than the slave nodes.

It's still an RFE in SGE to get any arbitrary combination of  
resources, e.g. you need for one job 1 host with big I/O, 2 with huge  
memory and 3 "standard" type of nodes you could request in Torque:

-l nodes=1:big_io+2:mem+3:standard

(Although this syntax has its own pitfalls: -l nodes=4:ppn=1 might  
still allocate 2 or more slots on a node AFAIO in my tests.)

As Craig pointed out correctly, it can be set up in SGE, but there  
might be combinatins where this gets convoluted. Thankfully I never  
needed such a type of allocation of nodes, but I just wanted to point  
out that this feature exists in Torque. If you don't need it for your  
type of jobs, ignore it ;-)

BTW @Craig: It should be possible to request -masterq  
qbigmem*@bigmem* in 6.0 which would shorten the line.

>>
>> OTOH, if you have parallel jobs:
>> http://www.beowulf.org/archive/2007-September/019269.html
>
> Thanks for the link. From my understanding of SGE, you can get tight
> integration with just about any MPI implementation. Is that true?

At least: all that I'm aware of. Nowadays I would go for Open MPI,  
which should work out-of-the-box with both queuing systems. If you  
need Linda or HP-MPI, seems SGE is the only option for a Tight  
Integration (between these two - I'm not aware of the features of LSF  
and others).

>> What is different between them from the idea: in Torque you submit  
>> a job
>> into a queue, while in SGE you request resources and SGE will  
>> select an
>> appropriate queue for you.
>
> You'll have to elaborate on this, too. From my knowledge of SGE,  
> you had
>  to specify the correct queue, too, or it went into the default queue.

There is nothing like a default queue in SGE (even the all.q, defined  
at installation time, is just a queue without any special features -  
you can edit or remove it, if you don't like to have it in your  
cluster).

If you have some queues with different limits: i.e. e.g. 1h wallclock  
vs. unlimited wallclock and two queues for 8 GB vs. 16 GB, a job  
requesting 2 hrs of wallclock time will for sure end up in the queue  
with unlimited time constraints, while a job requesting 30 minutes  
might end up in either of the two queues if a slot is free. Same  
stands for memory requests: a job requesting 4 GB might end up in any  
of the two memory limited queues, while a 12 GB request can only be  
run in the 16 GB queue. You don't have to specify it, as SGE will  
select the correct one which fulfills your requested resources.

--- Reuti

> In  SGE you can specify resources such as mem >= 32 GB for a node, or
> arch=AMD64. You can't do this with Torque? Seems like a very basic
> queuing system feature.