[Beowulf] Again about NUMA (numactl and taskset)
diep at xs4all.nl
Mon Jun 23 16:27:54 PDT 2008
On Jun 23, 2008, at 9:12 PM, Mark Hahn wrote:
>> "how sure are we that a process (or thread) that allocated and
>> initialized and writes to memory at a single specific memory node,
>> also keeps getting scheduled at a core on that memory node?"
> numactl --cpubind=0 --membind=0
>> It seems to me that sometimes (like every second or so) threads
>> jump from 1 memory node to another. I could be wrong,
>> but i certainly have that impression with the linux kernels.
> you can always tie a thread to a core. for non-bound threads,
> the question is really how long the kernel should leave a runnable
> thread "on" a busy cpu before running it on another (idle) cpu.
> the kernel
> does try to avoid this, but how hard has in the past depended on
> the kernel's guess about the cache footprint of the thread and its
> timeslice (how long it typically runs before yielding.)
Mark, thanks for your input. I've tried that numactl several times to
no avail. It kept doing wrong. Though this is from a few years ago,
last time i toyed a few days with numactl, it could have been
improved by now, maybe.
It is in itself a very relevant topic that Michael Kuzminsky
adresses, as when a thread allocates a lot of memory, it is really
Now i assume what i'm doing is on paper the ideal situation. If an
AMD machine with 2 to 4 memory controllers
has say 4 GB of ram, i give each process (memory - 500MB) / 4.
So that is quite a tad of RAM. This ram gets nonstop hammered upon
storing better ways to achieve finding the holy grail.
It's writing about 150k entries a second to RAM on each core to
memory controller at my AMD dual opteron dual core 2.4ghz machine
(probably by most of you considered nowadays as old energy wasting
junk, but well). If a cpu has say 750MB ram that means
to get a loading factor alpha of 0.5 into memory (ignoring the
chaining that happens a lot by the way as taking the
chaining is faster thanks to how latency to RAM works) is roughly 0.5
* 750M / (20 bytes * 0.150M/s) = 0.5 * 750 / 3 = 125 seconds
A game can last for an hour or 6.
So even a single switch within that 6 hours to another memory node is
I tried to lock with commands each process to a different core (4
processes, 4 cores). I still saw a flip sometimes.
Now of course at big clusters/supers some software support from
manufacturers allows automigration of nodes and memory, with good
So i guess for this type of scheduling we speak at a different level.
Avoiding latency of RAM over the network by scheduling
nodes closer to each other is really important.
Yet within 1 node it is a different story.
Suppose we've got 4 search processes P0..P3 and we have 4 cores C0..C3
I am guessing this happens, please tell me it is wrong:
some OS-service gets a timeslice at C1, searchprocess P1 gets
pushed backwards in the queue.
C2's timeslice at memory node 1 finishes. P2 gets pushed back in
FIFO queue. P1 is before P2 in the queue,
so P1 runs on C2.
What i want is in fact that P2 starts to run on C2 and P1 still keeps
in queue ntil the service timeslice finished.
Note the above is based purely on guessing based upon a 100
assumptions from something i *thought* i saw this or that;
i didn't look in kernel code for it. Compared to that Perry is not
paranoia at all.
When getting in a further state you only consider someone paranoia
when a person is paranoia with respect to future occasions;
seeing some sort of ghost or bottleneck will be there as "it has to
RGB don't give up yet! Do your public duty and take care that Perry
speaks out on those subjects,
as we all have to deal with it, some more than others! Maybe tempt
him post on how he deals with potential nuke builders?
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf