[Beowulf] Nehalem memory configs

Joe Landman landman at scalableinformatics.com
Sat Apr 11 11:43:42 PDT 2009


Since the part is released, I can report a stream test :)

richard.walsh at comcast.net wrote:

> 64 GB/sec is the right dual-socket theoretical number for this 
> situation, and Intel
> 
> presents the value of 33 GB/sec for the stream triad for the dual
> socket boards,
> 
> so 35 GB/sec could be a copy perhaps, but nothing was mentioned about
> any
> 
> benchmark in the memory piece.  In any case,  I think  we have the
> right theoretical
> 
> and probable real-world numbers expressed here, if people were
> wondering.

2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8 GHz 
processors, running stream omp (I bumped N way up to get a reasonable 
measurement).

landman at velocibunny:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 130623 microseconds.
    (= 130623 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       16545.0680       0.1942       0.1934       0.1958
Scale:      16098.2714       0.1996       0.1988       0.2019
Add:        17929.8514       0.2684       0.2677       0.2697
Triad:      17682.8117       0.2719       0.2715       0.2722
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

and for laughs, same test run (with same binary) on Shanghai 2.3 GHz 
(2376) with OMP_NUM_THREADS=4


landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 210029 microseconds.
    (= 210029 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10885.6547       0.2943       0.2940       0.2946
Scale:      10966.1188       0.2923       0.2918       0.2929
Add:        12019.7420       0.4002       0.3993       0.4012
Triad:      12127.1875       0.3965       0.3958       0.3968
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

I suspect we have the pegasus memory in a non-optimal config, will look 
later on next week.

Assuming we can get a pair of quad core Nehalem units into our test 
machine, it appears that 32 GB/s on stream is quite possible.  Right now 
it looks like ~4 GB/s per thread.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list