[Beowulf] Grid Engine multi-core thread binding enhancement -pre-alpha release
raysonlogin at gmail.com
Tue Jul 12 13:12:14 PDT 2011
On Mon, Jul 11, 2011 at 11:39 PM, Mark Hahn <hahn at mcmaster.ca> wrote:
> since this isn't an SGE list, I don't want to pursue an off-topic too far,
I think a lot of this will apply to non-SGE batch schedulers -- in
fact Torque will support hwloc in a future release.
And all mature batch systems (eg. LSF, SGE, SLURM) have some sort of
CPU set support for many years, but now this feature is more important
as the interaction of different hardware layers impacts the
performance more as more cores are added per socket.
> but out of curiosity, does this make the scheduler topology aware?
> that is, not just topo-aware binding, but topo-aware resource allocation?
> you know, avoid unnecessary resource contention among the threads belonging
> to multiple jobs that happen to be on the same node.
You can tell SGE (now: Grid Scheduler) how you want to allocate
hardware resource, but then different hardware architectures & program
behaviors can introduce interactions that will cause different
For example, a few years ago while I was still working for a large
UNIX system vendor, I found that a few SPEC OMP benchmarks run faster
when the threads are closer to each other (even when sharing the same
core by running in SMT mode), while most benchmarks benefit from more
L2/L3 caches & memory bandwidth (I'm talking about the same thread
count for both cases).
But it is hard even as a compiler developer to find out how to choose
the optimal thread allocation -- even with high-level array access
pattern information & memory bandwidth models available at compilation
time. For batch systems, we don't have as much info as the compiler.
While we can profile systems on the fly by PAPI, I doubt we will go
that route in the near future.
So, that means we need the job submitter to tell us what he wants --
in SGE/OGS, we have "qsub -binding striding:<amount>:<step-size>",
which means you will need to benchmark the code and see how the code
interacts with the hardware, and see whether it runs better with more
L2/L3/memory bandwidth (meaning step-size >= 2), or "qsub -binding
linear", which means the job will get the core by itself.
> large-memory processes
> not getting bound to a single memory node. packing both small and
> large-memory processes within a node. etc?
For memory nodes, a call to numactl should be able to handle most use-cases.
> thanks, mark hahn.
More information about the Beowulf