Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Nehalem memory configs

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Tom Elken tom.elken at qlogic.com
Mon Apr 13 12:02:28 PDT 2009


> On Behalf Of Joe Landman
> 
> Since the part is released, I can report a stream test :)

And so can I :-)    (below)

> 
> richard.walsh at comcast.net wrote:
> 
> > 64 GB/sec is the right dual-socket theoretical number for this
> > situation, and Intel
> > presents the value of 33 GB/sec for the stream triad for the dual
> > socket boards,
> >
> > so 35 GB/sec could be a copy perhaps, but nothing was mentioned about
> > any benchmark in the memory piece.  

The STREAM benchmark was mentioned in the delltechcenter piece, but which sub-benchmark (Triad or Copy, etc.) was not.
Here's some results we got on a Nehalem system with Dual 
Intel Xeon W5580  @ 3.20GHz CPUs,
6x 2GB DDR3-1333 dimms (one per memory channel),
and SMT turned off,
where all 4 STREAM components are over 37 GB/s when run on 8 threads over two CPUs:
------------------

OpenMP (8 threads)
Intel 11.0, icc -O3 -openmp -static
Array size = 32000000, Offset = 0
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       38705.2547       0.0134       0.0132       0.0135
Scale:      37735.3959       0.0137       0.0136       0.0138
Add:        37293.9249       0.0207       0.0206       0.0209
Triad:      37388.7235       0.0207       0.0205       0.0209

Serial
Intel 11.0,  icc -O3 -static
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10781.6770       0.0475       0.0475       0.0475
Scale:      10080.7104       0.0508       0.0508       0.0508
Add:        12646.7882       0.0608       0.0607       0.0608
Triad:      12628.8395       0.0608       0.0608       0.0608

-------------------

The 3.2 GHz, W5580 part is for workstations.  We'll remeasure when we get some servers with somewhat slower CPUs, but I would not expect a big difference from the above.

-Tom Elken


> In any case,  I think  we have the
> > right theoretical
> >
> > and probable real-world numbers expressed here, if people were
> > wondering.
> 
> 2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8
> GHz
> processors, running stream omp (I bumped N way up to get a reasonable
> measurement).
> 
> landman at velocibunny:~/stream$ ./stream_c_omp.exe
> -------------------------------------------------------------
> STREAM version $Revision: 5.8 $
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 200000000, Offset = 0
> Total memory required = 4577.6 MB.
> Each test is run 10 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Number of Threads requested = 4
> -------------------------------------------------------------
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 130623 microseconds.
>     (= 130623 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:       16545.0680       0.1942       0.1934       0.1958
> Scale:      16098.2714       0.1996       0.1988       0.2019
> Add:        17929.8514       0.2684       0.2677       0.2697
> Triad:      17682.8117       0.2719       0.2715       0.2722
> -------------------------------------------------------------
> Solution Validates
> -------------------------------------------------------------
> 
> and for laughs, same test run (with same binary) on Shanghai 2.3 GHz
> (2376) with OMP_NUM_THREADS=4
> 
> 
> landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe
> -------------------------------------------------------------
> STREAM version $Revision: 5.8 $
> -------------------------------------------------------------
> This system uses 8 bytes per DOUBLE PRECISION word.
> -------------------------------------------------------------
> Array size = 200000000, Offset = 0
> Total memory required = 4577.6 MB.
> Each test is run 10 times, but only
> the *best* time for each is used.
> -------------------------------------------------------------
> Number of Threads requested = 4
> -------------------------------------------------------------
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> Printing one line per active thread....
> -------------------------------------------------------------
> Your clock granularity/precision appears to be 1 microseconds.
> Each test below will take on the order of 210029 microseconds.
>     (= 210029 clock ticks)
> Increase the size of the arrays if this shows that
> you are not getting at least 20 clock ticks per test.
> -------------------------------------------------------------
> WARNING -- The above is only a rough guideline.
> For best results, please be sure you know the
> precision of your system timer.
> -------------------------------------------------------------
> Function      Rate (MB/s)   Avg time     Min time     Max time
> Copy:       10885.6547       0.2943       0.2940       0.2946
> Scale:      10966.1188       0.2923       0.2918       0.2929
> Add:        12019.7420       0.4002       0.3993       0.4012
> Triad:      12127.1875       0.3965       0.3958       0.3968
> -------------------------------------------------------------
> Solution Validates
> -------------------------------------------------------------
> 
> I suspect we have the pegasus memory in a non-optimal config, will look
> later on next week.
> 
> Assuming we can get a pair of quad core Nehalem units into our test
> machine, it appears that 32 GB/s on stream is quite possible.  Right
> now
> it looks like ~4 GB/s per thread.
> 
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics LLC,
> email: landman at scalableinformatics.com
> web  : http://www.scalableinformatics.com
>         http://jackrabbit.scalableinformatics.com
> phone: +1 734 786 8423 x121
> fax  : +1 866 888 3112
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list