[Beowulf] Which Xeon supports NUMA?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mark Hahn hahn at mcmaster.caTue Mar 18 14:00:10 PDT 2008
- Previous message: [Beowulf] Which Xeon supports NUMA?
- Next message: [Beowulf] Three questions on a new Beowulf Cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> given that many core Xeons (especially quad and/or many socket systems) have > some memory speed issues. With NUMA the kernel seems to be able to optimize > this somehow. I don't believe so. Intel currently still uses a single memory controller (MCH), which means that memory access is, in the NUMA sense, uniform. I don't believe that Intel's recent use of multiple socket-MCH links, or multiple independent FBDIMM channels off the MCH change this. here's an Intel It2 chipset: http://www.intel.com/products/i/chipsets/e8870sp/e8870_blkdiag_8way_800.jpg you can see that there are two FSB's with 4cpus each. a CPU on the left will have non-uniform access to a memory bank which happens to be on the right side of the system. I don't believe any of the Intel x86 chipsets provide this kind of design, though several other companies have done numa x86 chipsets (IBM for one). the interesting thing is that Intel has decided to embrace the numa-oriented system architecture of AMD (et al). it'll be very interesting to see how this plays out with Nehalem/QPI. obviously, AMD really, really needs to wake up and try a little harder to complete... > (2) More importantly, has someone measured (how?) if this improves > performance? usually, tuning for NUMA just means trying to keep a process near its memory. in the chipset above, if a proc starts on the left half, make an effort to allocate its memory on the left as well, and keep scheduling it on left cpus. the kernel does contain code that tries to understand this topology - the most common machines that use it are multi-socket opteron boxes. but systems like SGI Altix depend on this sort of thing quite heavily. following is a trivial measurement of the effect. I'm running the stream benchmark on a single thread. in the first case, I force the process and memory to be on the same socket. then the "wrong" socket. [hahn at rb17 ~]$ numactl --membind=0 --cpubind=0 ./s ... The total memory requirement is 1144 MB You are running each test 11 times ... Function Rate (MB/s) Avg time Min time Max time Copy: 5298.8324 0.1515 0.1510 0.1520 Scale: 5334.1523 0.1504 0.1500 0.1510 Add: 5455.4020 0.2200 0.2200 0.2200 Triad: 5455.3902 0.2200 0.2200 0.2200 ... [hahn at rb17 ~]$ numactl --membind=0 --cpubind=1 ./s ... Function Rate (MB/s) Avg time Min time Max time Copy: 3556.1072 0.2253 0.2250 0.2260 Scale: 3620.4688 0.2213 0.2210 0.2220 Add: 3647.9716 0.3305 0.3289 0.3310 Triad: 3659.0890 0.3305 0.3280 0.3310 note that NUMA optimizations are a wonderful thing, but hardly a panacea. for instance, a busy system might not be able to put all a proc's memory on a particular node. or perhaps the cpus of that node are busy. and then think about multithreaded programs. on top of that, consider caches, which these days are variously per-core, per-chip and per-socket. > Thanks for a brief answer oh, sorry ;)
- Previous message: [Beowulf] Which Xeon supports NUMA?
- Next message: [Beowulf] Three questions on a new Beowulf Cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
