[Beowulf] Again about NUMA (numactl and taskset)

Mon Jun 23 09:41:21 PDT 2008

I would add to this:

"how sure are we that a process (or thread) that allocated and  
initialized and writes to memory at a single specific memory node,
also keeps getting scheduled at a core on that memory node?"

It seems to me that sometimes (like every second or so) threads jump  
from 1 memory node to another. I could be wrong,
but i certainly have that impression with the linux kernels.

That said, it has improved a lot, now all we need is a better  
compiler for linux. GCC is for my chessprogram generating an
executable that gets  22% slower positions per second than visual c++  
2005 is.

Thanks,
Vincent

On Jun 23, 2008, at 4:01 PM, Mikhail Kuzminsky wrote:

> I'm testing my 1st dual-socket quad-core Opteron 2350-based server.
> Let me assume that the RAM used by kernel and system processes is  
> zero, there is no physical RAM fragmentation, and the affinity of  
> processes to CPU cores is maintained. I assume also that both the  
> nodes are populated w/equal number of the same DIMMs.
>
> If I run thread- parallelized (for example, w/OpenMP) application w/ 
> 8 threads (8 = number of server CPU cores), the ideal case for all  
> the ("equal") threads is: the shared memory used by each of 2 CPUs  
> (by each of 2 processes "quads") should be divided equally between  
> 2 nodes, and the local memory used by each process should be mapped  
> analogically.
> Theoretically like ideal case may be realized if my application (8  
> threads) uses practically all the RAM and uses only shared memory  
> (I assume here also that all the RAM addresses have the same load,  
> and the size of program codes is zero :-) ).
>
> The questions are
> 1) Is there some way to distribute analogously the local memory of  
> threads (I assume that it have the same size for each thread) using  
> "reasonable" NUMA allocation ?
>
> 2) Is it right that using of numactl for applications may gives  
> improvements of performance for the following case:
> the number of application processes is equal to the number of cores  
> of one CPU *AND* the necessary (for application) RAM amount may be  
> placed on one node DIMMs (I assume that RAM is allocated  
> "continously").
>
> What will be w/performance (at numactl using) for the case if RAM  
> size required is higher than RAM available per one node, and  
> therefore the program will not use the possibility of (load  
> balanced) simultaneous using of memory controllers on both CPUs ?  
> (I also assume also that RAM is allocated continously).
>
> 3) Is there some reason to use things like
> mpirun -np N /usr/bin/numactl <numactl_parameters>  my_application   ?
>
> 4) If I use malloc()  and don't use numactl, how to understand -  
> from which node Linux will begin the real memory allocation ? (I  
> remember that I assume that all the RAM is free) And how to  
> understand  where are placed the DIMMs which will corresponds to  
> higher RAM addresses or lower RAM addresses ?
>
> 5) In which cases is it reasonable to switch on "Node memory  
> interleaving" (in BIOS) for the application which uses more memory  
> than is presented on the node ?
> And BTW: if I use taskset -c CPU1,CPU2, ... <program_file>
> and the program_file creates some new processes, will all this  
> processes run only on the same CPUs defined in taskset command ?
>
> Mikhail Kuzminsky
> Computer Assistance to Chemical Research Center,
> Zelinsky Institute of Organic Chemistry
> Moscow
>
>      _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf
>