[Beowulf] Again about NUMA (numactl and taskset)
Hakon.Bugge at scali.com
Fri Jul 18 07:46:59 PDT 2008
At 08:39 27.06.2008, Patrick Geoffray wrote:
>Håkon Bugge wrote:
>>This is information we're using to optimize how
>>pnt-to-pnt communication is implemented. The
>>code-base involved is fairly complicated and I
>>do not expect resource management systems to cope with it.
>Why not ? It's its job to know the resources it
>has to manage. The resource manager has more
>information than you, it does not have to detect
>at runtime for each job, and it can manage cores
>allocation across jobs. You cannot expect the
>granularity of the allocation to stay at the
>node level with the core count increasing.
This raises two questions: a) Which job
schedulers are able to optimize placement on
cores thereby _improving_ application
performance? b) which job schedulers are able to
deduct which cores share a L3 cache and are situated on the same socket?
... and a clarification. Systems using Scali MPI
Connect _can_ have finer granularity than the
node level; the job scheduler must just not
oversubscribe. Assignment of cores to processes
is _dynamically_ done by Scali MPI Connect.
>If the MPI implementation does the spawning, it
>should definitively have support to enforce core
>affinity (most do AFAIK). However, core affinity
>should be dictated by the scheduler. Heck, the
>MPI implementation should not do the spawning in the first place.
>Historically, resource managers have been pretty
>dumb. These days, there is enough competition in this domain to expect better.
I am fine with the schedulers dictating it, but not if the performance is hurt.
More information about the Beowulf