[Beowulf] Nehalem memory configs
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tom Elken tom.elken at qlogic.comMon Apr 13 12:02:28 PDT 2009
- Previous message: [Beowulf] Nehalem memory configs
- Next message: [Beowulf] Repenting for sins against Dell (on good Friday, no less)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> On Behalf Of Joe Landman > > Since the part is released, I can report a stream test :) And so can I :-) (below) > > richard.walsh at comcast.net wrote: > > > 64 GB/sec is the right dual-socket theoretical number for this > > situation, and Intel > > presents the value of 33 GB/sec for the stream triad for the dual > > socket boards, > > > > so 35 GB/sec could be a copy perhaps, but nothing was mentioned about > > any benchmark in the memory piece. The STREAM benchmark was mentioned in the delltechcenter piece, but which sub-benchmark (Triad or Copy, etc.) was not. Here's some results we got on a Nehalem system with Dual Intel Xeon W5580 @ 3.20GHz CPUs, 6x 2GB DDR3-1333 dimms (one per memory channel), and SMT turned off, where all 4 STREAM components are over 37 GB/s when run on 8 threads over two CPUs: ------------------ OpenMP (8 threads) Intel 11.0, icc -O3 -openmp -static Array size = 32000000, Offset = 0 ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 38705.2547 0.0134 0.0132 0.0135 Scale: 37735.3959 0.0137 0.0136 0.0138 Add: 37293.9249 0.0207 0.0206 0.0209 Triad: 37388.7235 0.0207 0.0205 0.0209 Serial Intel 11.0, icc -O3 -static ------------------------------------------------------------- Function Rate (MB/s) Avg time Min time Max time Copy: 10781.6770 0.0475 0.0475 0.0475 Scale: 10080.7104 0.0508 0.0508 0.0508 Add: 12646.7882 0.0608 0.0607 0.0608 Triad: 12628.8395 0.0608 0.0608 0.0608 ------------------- The 3.2 GHz, W5580 part is for workstations. We'll remeasure when we get some servers with somewhat slower CPUs, but I would not expect a big difference from the above. -Tom Elken > In any case, I think we have the > > right theoretical > > > > and probable real-world numbers expressed here, if people were > > wondering. > > 2-socket Intel MB with 2 dual core (not quad core) Nehalem E5502 1.8 > GHz > processors, running stream omp (I bumped N way up to get a reasonable > measurement). > > landman at velocibunny:~/stream$ ./stream_c_omp.exe > ------------------------------------------------------------- > STREAM version $Revision: 5.8 $ > ------------------------------------------------------------- > This system uses 8 bytes per DOUBLE PRECISION word. > ------------------------------------------------------------- > Array size = 200000000, Offset = 0 > Total memory required = 4577.6 MB. > Each test is run 10 times, but only > the *best* time for each is used. > ------------------------------------------------------------- > Number of Threads requested = 4 > ------------------------------------------------------------- > Printing one line per active thread.... > Printing one line per active thread.... > Printing one line per active thread.... > Printing one line per active thread.... > ------------------------------------------------------------- > Your clock granularity/precision appears to be 1 microseconds. > Each test below will take on the order of 130623 microseconds. > (= 130623 clock ticks) > Increase the size of the arrays if this shows that > you are not getting at least 20 clock ticks per test. > ------------------------------------------------------------- > WARNING -- The above is only a rough guideline. > For best results, please be sure you know the > precision of your system timer. > ------------------------------------------------------------- > Function Rate (MB/s) Avg time Min time Max time > Copy: 16545.0680 0.1942 0.1934 0.1958 > Scale: 16098.2714 0.1996 0.1988 0.2019 > Add: 17929.8514 0.2684 0.2677 0.2697 > Triad: 17682.8117 0.2719 0.2715 0.2722 > ------------------------------------------------------------- > Solution Validates > ------------------------------------------------------------- > > and for laughs, same test run (with same binary) on Shanghai 2.3 GHz > (2376) with OMP_NUM_THREADS=4 > > > landman at pegasus-a3g:~/stream$ ./stream_c_omp.exe > ------------------------------------------------------------- > STREAM version $Revision: 5.8 $ > ------------------------------------------------------------- > This system uses 8 bytes per DOUBLE PRECISION word. > ------------------------------------------------------------- > Array size = 200000000, Offset = 0 > Total memory required = 4577.6 MB. > Each test is run 10 times, but only > the *best* time for each is used. > ------------------------------------------------------------- > Number of Threads requested = 4 > ------------------------------------------------------------- > Printing one line per active thread.... > Printing one line per active thread.... > Printing one line per active thread.... > Printing one line per active thread.... > ------------------------------------------------------------- > Your clock granularity/precision appears to be 1 microseconds. > Each test below will take on the order of 210029 microseconds. > (= 210029 clock ticks) > Increase the size of the arrays if this shows that > you are not getting at least 20 clock ticks per test. > ------------------------------------------------------------- > WARNING -- The above is only a rough guideline. > For best results, please be sure you know the > precision of your system timer. > ------------------------------------------------------------- > Function Rate (MB/s) Avg time Min time Max time > Copy: 10885.6547 0.2943 0.2940 0.2946 > Scale: 10966.1188 0.2923 0.2918 0.2929 > Add: 12019.7420 0.4002 0.3993 0.4012 > Triad: 12127.1875 0.3965 0.3958 0.3968 > ------------------------------------------------------------- > Solution Validates > ------------------------------------------------------------- > > I suspect we have the pegasus memory in a non-optimal config, will look > later on next week. > > Assuming we can get a pair of quad core Nehalem units into our test > machine, it appears that 32 GB/s on stream is quite possible. Right > now > it looks like ~4 GB/s per thread. > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics LLC, > email: landman at scalableinformatics.com > web : http://www.scalableinformatics.com > http://jackrabbit.scalableinformatics.com > phone: +1 734 786 8423 x121 > fax : +1 866 888 3112 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin > Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: [Beowulf] Nehalem memory configs
- Next message: [Beowulf] Repenting for sins against Dell (on good Friday, no less)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
