[Beowulf] performance tweaks and optimum memory configs for a Nehalem

Gus Correa gus at ldeo.columbia.edu
Mon Aug 10 12:40:15 PDT 2009

Joshua Baker-LePain wrote:
> On Mon, 10 Aug 2009 at 11:43am, Rahul Nabar wrote
>> On Mon, Aug 10, 2009 at 7:41 AM, Mark Hahn<hahn at mcmaster.ca> wrote:
>>>> (a) I am seeing strange scaling behaviours with Nehlem cores. eg A
>>>> specific DFT (Density Functional Theory) code we use is maxing out
>>>> performance at 2, 4 cpus instead of 8. i.e. runs on 8 cores are
>>>> actually slower than 2 and 4 cores (depending on setup)
>>> this is on the machine which reports 16 cores, right?  I'm guessing
>>> that the kernel is compiled without numa and/or ht, so enumerates 
>>> virtual
>>> cpus first.  that would mean that when otherwise idle, a 2-core
>>> proc will get virtual cores within the same physical core.  and that 
>>> your 8c
>>> test is merely keeping the first socket busy.
>> No. On both machines. The one reporting 16 cores and the other
>> reporting 8. i.e. one hyperthreaded and the other not. Both having 8
>> physical cores.
>> What is bizarre is I tried using -np 16. THat ought to definitely
>> utilize all cores, right? I'd have expected the 16 core performance to
>> be the best. BUt no the performance peaks at a smaller number of
>> cores.
> Well, as there are only 8 "real" cores, running a computationally 
> intensive process across 16 should *definitely* do worse than across 8. 
> However, it's not so surprising that you're seeing peak performance with 
> 2-4 threads.  Nehalem can actually overclock itself when only some of 
> the cores are busy -- it's called Turbo Mode.  That *could* be what 
> you're seeing.

Hi Rahul, Joshua, list

If Rahul is running these tests with his production jobs,
which he says require 2GB/process, and if he has 24GB/node
(or is it 16GB/node?), then with 16 processes running on a node
memory paging probably kicked in,
because the physical memory is less than 32GB.
Would this be the reason for the drop in performance, Rahul?

In any case, Joshua is right that you can't expect linear scaling from
8 to 16 processes on a node.
What I saw on an IBM machine
with PPC-6 and SMT (similar to Intel hyperthreading)
was a speedup of around 1.4, rather than 2.
Still a great deal!

If I understand right, hyperthreading opportunistically uses idle
execution units on a core to schedule a second thread to use them.
As clever and efficient as it is, I would guess this mechanism
cannot produce as much work as two physical cores.
There is an article about it in Tom's Hardware:

My $0.02 of guesses
Gus Correa
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA

> ------------------------------------------------------------------------
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list