Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Nehalem memory configs

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Joe Landman landman at scalableinformatics.com
Sat Apr 11 11:43:42 PDT 2009


Since the part is released, I can report a stream test :)

richard.walsh at comcast.net wrote:

> 64 GB/sec is the right dual-socket theoretical number for this 
> situation, and Intel
> 
> presents the value of 33 GB/sec for the stream triad for the dual
> socket boards,
> 
> so 35 GB/sec could be a copy perhaps, but nothing was mentioned about
> any
> 
> benchmark in the memory piece.  In any case,  I think  we have the
> right theoretical
> 
> and probable real-world numbers expressed here, if people were
> wondering.

2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8 GHz 
processors, running stream omp (I bumped N way up to get a reasonable 
measurement).

landman at velocibunny:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 130623 microseconds.
    (= 130623 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       16545.0680       0.1942       0.1934       0.1958
Scale:      16098.2714       0.1996       0.1988       0.2019
Add:        17929.8514       0.2684       0.2677       0.2697
Triad:      17682.8117       0.2719       0.2715       0.2722
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

and for laughs, same test run (with same binary) on Shanghai 2.3 GHz 
(2376) with OMP_NUM_THREADS=4


landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe
-------------------------------------------------------------
STREAM version $Revision: 5.8 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 200000000, Offset = 0
Total memory required = 4577.6 MB.
Each test is run 10 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Number of Threads requested = 4
-------------------------------------------------------------
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 210029 microseconds.
    (= 210029 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10885.6547       0.2943       0.2940       0.2946
Scale:      10966.1188       0.2923       0.2918       0.2929
Add:        12019.7420       0.4002       0.3993       0.4012
Triad:      12127.1875       0.3965       0.3958       0.3968
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------

I suspect we have the pegasus memory in a non-optimal config, will look 
later on next week.

Assuming we can get a pair of quad core Nehalem units into our test 
machine, it appears that 32 GB/s on stream is quite possible.  Right now 
it looks like ~4 GB/s per thread.

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list