[Beowulf] bizarre scaling behavior on a Nehalem

Wed Aug 12 11:50:16 PDT 2009

In message from Gus Correa <gus at ldeo.columbia.edu> (Wed, 12 Aug 2009 
14:09:04 -0400):
>Hi Bill, list
>
>Bill:  This is very interesting indeed.  Thanks for sharing!
>
>Bill's graph seem to show that Shanghai and Barcelona scale
>(almost) linearly with the number of cores, whereas Nehalem stops
>scaling and flattens out at 4 cores.
>The Nehalem 8 cores and 4 cores curves are virtually 
>indistinguishable,
>and for very large arrays 4 cores is ahead.
>Only for huge arrays (>16M) Nehalem gets ahead
>of Shanghai and Barcelona.

IMHO, if arrays are not "huge", they will fit in cache L3 (8MB !).
Or on X axe are presented Mwords ?

Mikhail

>
>Did I interpret the graph right?
>Wasn't this type of scaling problem that plagued
>the Clovertown and Harpertown?
>Any possibility that kernels, BIOS, etc, are not yet ready for 
>Nehalem?
>
>Thanks,
>Gus Correa
>---------------------------------------------------------------------
>Gustavo Correa
>Lamont-Doherty Earth Observatory - Columbia University
>Palisades, NY, 10964-8000 - USA
>---------------------------------------------------------------------
>
>Bill Broadley wrote:
>> I've been working on a pthread memory benchmark that is loosely 
>>modeled on
>> McCalpin's stream.  It's been quite a challenge to remove all the 
>>noise/lost
>> performance from the benchmark to get close to performance I 
>>expected.  Some
>> of the obstacles:
>> * For the compilers that tend to be better at stream (open64 and 
>>pathscale),
>>   you lose the performance if you just replace double a[],b[],c[] 
>>with
>>   double *a,*b,*c. Patch[1] available.  I don't have a work around 
>>for
>>   this, suggestions welcome.  Is it really necessary for dynamic 
>>arrays
>>   to be substantially slower than static?
>> * You have to be very careful with pointer alignment both with cache 
>>lines,
>>   and each other
>> * cpu_affinity (by CPU id)
>> * numa (by socket id)
>> 
>> The results are relatively smooth graphs, here's an example, it's 
>>uselessly
>> busy until you toggle off a few graphs (by clicking on the key):
>> 
>> http://cse.ucdavis.edu/bill/pstream.svg
>> 
>> The biggest puzzle I have now is what the previous generation intel 
>>quads, the
>> current generation AMD quads, and numerous other CPUs show a big 
>>benefit in
>> L1, while the nehalem shows no benefit.
>> 
>> [1] http://cse.ucdavis.edu/bill/stream-malloc.patch
>> 
>> 
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>>Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>>http://www.beowulf.org/mailman/listinfo/beowulf
>
>_______________________________________________
>Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
>Computing
>To change your subscription (digest mode or unsubscribe) visit 
>http://www.beowulf.org/mailman/listinfo/beowulf
>
>-- 
>üÔÏ ÓÏÏÂÝÅÎÉÅ ÂÙÌÏ ÐÒÏ×ÅÒÅÎÏ ÎÁ ÎÁÌÉÞÉÅ × ÎÅÍ ×ÉÒÕÓÏ×
>É ÉÎÏÇÏ ÏÐÁÓÎÏÇÏ ÓÏÄÅÒÖÉÍÏÇÏ ÐÏÓÒÅÄÓÔ×ÏÍ
>MailScanner, É ÍÙ ÎÁÄÅÅÍÓÑ
>ÞÔÏ ÏÎÏ ÎÅ ÓÏÄÅÒÖÉÔ ×ÒÅÄÏÎÏÓÎÏÇÏ ËÏÄÁ.
>