[Beowulf] NUMA info request
mark.kosmowski at gmail.com
Wed Mar 26 02:33:08 PDT 2008
On Tue, Mar 25, 2008 at 12:40 PM, <kyron at neuralbs.com> wrote:
> > On Tue, Mar 25, 2008 at 12:17 AM, Eric Thibodeau <kyron at neuralbs.com>
> > wrote:
> >> Mark Hahn wrote:
> >> >> NUMA is an acronym meaning Non Uniform Memory Access. This is a
> >> >> hardware constraint and is not a "performance" switch you turn on.
> >> >> Under the Linux
> >> >
> >> > I don't agree. NUMA is indeed a description of hardware. I'm not
> >> > sure what you meant by "constraint" - NUMA is not some kind of
> >> > shortcoming.
> >> Mark is right, my choice of words is misleading. By constraint I meant
> >> that you have to be conscious of what ends up where (that was the point
> >> of the link I added in my e-mail ;P )
> >> >> kernel there is an option that is meant to tell the kernel to be
> >> >> conscious about that hardware fact and attempt to help it optimize
> >> >> the way it maps the memory allocation to a task Vs the processor the
> >> >> given task will be using (processor affinity, check out taskset (in
> >> >> recent util-linux implementations, ie: 2.13+).
> >> > the kernel has had various forms of NUMA and socket affinity for a
> >> > long time,
> >> > and I suspect most any distro will install kernel which has the
> >> > appropriate support (surely any x86_64 kernel would have NUMA
> >> support).
> >> My point of view on distro kernels is that they are to be scrutinized
> >> unless they are specifically meant to be used as computation nodes (ie:
> >> don't expect CONFIG_HZ=100 to be set on "typical" distros).
> >> Also, NUMA is only applicable to Opteron architecture (internal MMU
> >> with
> >> HyperTransport), not the Intel flavor of multi-core CPUs (external MMU,
> >> which can be a single bus or any memory access scheme as dictated by
> >> the
> >> motherboard manufacturer).
> >> >
> >> > I usually use numactl rather than taskset. I'm not sure of the
> >> > history of those tools. as far as I can tell, taskset only addresses
> >> > numactl --cpubind,
> >> > though they obviously approach things differently. if you're going
> >> to
> >> > use taskset, you'll want to set cpu affinity to multiple cpus (those
> >> > local to a socket, or 'node' in numactl terms.)
> >> >
> >> >> In your specific case, you would have 4Gigs per CPU and would want
> >> >> to make sure each task (assuming one per CPU) stays on the same CPU
> >> >> all the time and would want to make sure each task fits within the
> >> >> "local" 4Gig.
> >> >
> >> > "numactl --localalloc".
> >> >
> >> > but you should first verify that your machines actually do have the
> >> 8GB
> >> > split across both nodes. it's not that uncommon to see an
> >> > inexperienced assembler fill up one node before going onto the next,
> >> > and there have even
> >> > been some boards which provided no memory to the second node.
> >> Mark (Hahn) is right (again !), I ASSumed the tech would load the
> >> memory
> >> banks appropriately, don't make that mistake ;) And numactl is indeed
> >> more appropriate in this case (thanks Mr. Hahn ;) ). Note that the
> >> kernel (configured with NUMA) _will_ attempt to allocate the memory to
> >> "'local nodes" before offloading to memory "abroad".
> >> Eric
> > The memory will be installed by myself correctly - that is,
> > distributing the memory according to cpu. However, it appears that
> > one of my nodes (my first Opteron machine) may well be one that has
> > only one bank of four DIMM slots assigned to cpu 0 and shared by cpu
> > 1. It uses a Tyan K8W Tiger s2875 motherboard. My other two nodes
> > use Arima HDAMA motherboards with SATA support - each cpu has a bank
> > of 4 DIMMs associated with it. The Tyan node is getting 4 @ 2 Gb
> > DIMMs, one of the HDAMA nodes is getting 8 @ 1 Gb (both instances
> > fully populating the available DIMM slots) and the last machine is
> > going to get 4 @ 1 Gb DIMMs for one cpu and 2 @ 2 Gb for the other.
> That last scheme might give you some unbalanced performance but that is
> something to look up with the MB's instruction manual (ie: you might be
> better off installing the RAM as 1G+1G+2G for both CPUs instead of 4x1G +
On my Opteron systems, wouldn't 3 DIMMs per CPU drop me into 64-bit
memory bandwidth rather than the allowed 128-bit memory bandwidth when
each CPU has an even number of DIMMs?
> > It looks like I may want to upgrade my motherboard before exploring
> > NUMA / affinity then.
> If you're getting into "upgrading" (ie: trowing money at) anything, then
> you're getting into the slippery slope of the hardware selection debate ;)
Slippery indeed. At this point, I think I may just install the RAM to
bring my current calculation out of swap and be done with the cluster
for now. Given that I think one of my nodes uses hypertransport for
all of cpu 1 memory access, would it hurt anything to use affinity
when only 2 out of 3 nodes can benefit from affinity?
> > This discussion as well as reading about NUMA and affinity elsewhere
> > leads to another question - what is the difference between using
> > numactl or using the affinity options of my parallelization software
> > (in my case openmpi)?
> numactl is an application to help nudge processes in the correct
> direction. Implementing cpuaffinity within your code makes your code
> explicitally aware that it will run on an SMP machine (ie: it's hardcoded
> and you don't need to call a script to change your processe's affinity).
> In that regards Chris Samuel replied with the mention of Torque and PBS
> which would support affinity assignment. IMHO, that would be the most
> logical place to control affinity (as long as one can provide some memory
> access hints, ie: same options as seen in numactl's manpage)
> > Thanks,
> > Mark (Kosmowski)
> Eric Thibodeau
Again, thank you for this discussion - I'm learning quite a bit!
More information about the Beowulf