[Beowulf] scheduler recommendations for a HPC cluster

Chris Dagdigian dag at sonsorol.org
Tue Oct 6 16:04:58 PDT 2009


Platform LSF one of the best available offerings if you consider the  
overall administrative burden, APIs, quality of documentation and  
quality of support. Basically for some cases the cost of the  
commercial software license more than pays for itself in having a  
product that is stable, well documented and has a very low admin/ 
operational burden. Trying to save money on an open source product and  
then needing to hire additional people to keep it from falling over is  
a mistake I've seen at more than one site.

That said, I'm personally a Grid Engine zealot these days and use/ 
deploy it often. Open source and commercial flavors, amazing support  
community and a high level of product acceptance & market share in the  
life sciences which is where I do most of my work.

In the interest of full disclosure I do Grid Engine consulting &  
training so I'm not totally unbiased here.

When it comes to PBS variants I'd avoid the pure open source versions  
of PBS/Torque - I don't think I've ever been in an openPBS or Torque  
shop that has not altered the source code or otherwise dug deeply into  
the product. For people considering the PBS route I always recommend  
checking in with the pbspro people first.

Just my $.02

-Chris



On Oct 6, 2009, at 9:22 PM, Rahul Nabar wrote:

> Any strong / weak recommendations for / against schedulers? For a long
> time we have worked happily with a Torque + Maui system. It isn't
> perfect but works (and is free!). But rarely does a chance present
> itself to go for something "newer and better" on a in-production
> system since people hate changes and outages. This time as we shop for
> a new cluster it presents me the opportunity to change if something
> better exists.
>
> Any comments? What are other users using out there?  Any horror
> stories? Or any super good finds?
>
> I shy against LSF etc since those cost a lot of money.  Especially as
> they, and similar systems are mostly licensed per server per year so
> the costs do add up. I have been a user on  a LSF systems for a long
> time and I think it is an awesome scheduler but have never been at the
> admin end of LSF.
>
> One thing that the Torque+Maui option is not the best is that it is
> not monolithic. Oftentimes it is hard to know which component to blame
> for a problem or more relevant which config file to use to fix a
> problem. Torque or Maui. On the other hand , can't get rid of Maui
> since Fairshare policies etc. are important to us and those seem to be
> in the Maui domain. (all our jobs are MPI jobs in case that is
> relevant. We haven't been doing checkpointing yet)
>
> Of course, there is MOAB these days, but I am not sure if that is
> worth the money since I have not used it.
>
> I appreciate any comments or words of wisdom you guys might have!
>
> -- 
> Rahul
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list