Linux cpusets and HPC (was Re: [Beowulf] Can one Infiniband net support MPI and a parallel filesystem?)

Chris Samuel csamuel at
Thu Aug 14 01:03:16 PDT 2008

----- "Paul Jackson" <pj at> wrote:

Hi Paul,

> Chris wrote:
> > The 2.6 cpuset support in Torque came out of a long
> Would you have any pointers to some more details of what you've
> done here?

Sure - mostly it was discussed on the torquedev list after
the initial discussion at SC'07, Garrick started the thread

His announcement of the initial implementation, along with
notes on the differences from the plan is here:

There is a Wiki page on it too, but that isn't up to date as
it doesn't mention not using the per-vnode/core cpusets due
to the OpenMPI issues.

> I'm the maintainer, and one of the authors, of Linux 2.6
> cpusets, and would like to do what I can with cpusets to make life
> easier (or at least no more painful) for cluster and MPI folks.

Wonderful!  First of all thanks so much for the code.

The only major issue we've come across is not due to cpusets
themselves but just the way that things like OpenMPI tend to
work in that they send launch a single process per node via
the MPI launcher and then that forks off all the child processes
necessary.  This means it's not easy to lock MPI tasks to
cores via this method, and it's also not trivial for the MPI
program to be able to work out what cores it can try and
bind itself to via setaffinity().

> My background comes more from the "big honkin NUMA iron"
> running a Single System Image on 100's or 1000's of CPUs
> (SGI Irix/Origin and later Linux/Altix), which was the
> "country of origin" for cpusets, so my interest (and
> ignorance) in asking this question is more to gain
> an understanding of how cpusets have been adapted to
> clusters, as I understand less well the needs of clusters,
> and what if anything cpusets might do here to be of more use.

The main purpose we're using them for is a quick and
easy way to catch users who don't know better doing
things like running an OpenMP code as a single CPU job
and overloading a node (and causing chaos for other
users) when it discovers 8 cores.

Single CPU jobs get the benefit of being locked to
a single core, and even MPI jobs get some benefit
in that they can only be migrated between cores
they've been allocated.

> Totally totally trivial nit -- you wrote:
> I prefer in my setups to have that mount command be:
>   mount -t cpuset cpuset /dev/cpuset
> so that the mount shows up in the output of the mount(8) command
> with 'cpuset' in the mount 'device' field, not 'none'.

Thanks for that, much appreciated!

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency

More information about the Beowulf mailing list