Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Nehalem and Shanghai code performance for our rzf example

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Joe Landman landman at scalableinformatics.com
Fri Jan 16 06:25:49 PST 2009


Hi folks:

   Thought you might like to see this.  I rewrote the interior loop for 
our Riemann Zeta Function (rzf) example for SSE2, and ran it on a 
Nehalem and on a Shanghai.  This code is compute intensive.  The inner 
loop which had been written as this (some small hand optimization, loop 
unrolling, etc):

     l[0]=(double)(inf-1 - 0);
     l[1]=(double)(inf-1 - 1);
     l[2]=(double)(inf-1 - 2);
     l[3]=(double)(inf-1 - 3);
     p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
     for(k=start_index;k>end_index;k-=unroll)
        {
           d_pow[0] = l[0];
           d_pow[1] = l[1];
           d_pow[2] = l[2];
           d_pow[3] = l[3];

           for (m=n;m>1;m--)
            {
              d_pow[0] *=  l[0];
              d_pow[1] *=  l[1];
              d_pow[2] *=  l[2];
              d_pow[3] *=  l[3];
            }
           p_sum[0] += one/d_pow[0];
           p_sum[1] += one/d_pow[1];
           p_sum[2] += one/d_pow[2];
           p_sum[3] += one/d_pow[3];

           l[0]-=four;
           l[1]-=four;
           l[2]-=four;
           l[3]-=four;
       }
     sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;

has been rewritten as

     __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ... VLEN] = 0
     __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
     __m128d __DEC = _mm_set_pd1((double)VLEN);
     __m128d __L   = _mm_load_pd(l);

     for(k=start_index;k>end_index;k-=unroll)
        {
           __D_POW       = __L;

           for (m=n;m>1;m--)
            {
              __D_POW    = _mm_mul_pd(__D_POW, __L);
            }

           __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE, __D_POW));

           __L           = _mm_sub_pd(__L, __DEC);

       }

     _mm_store_pd(p_sum,__P_SUM);

     for(k=0;k<VLEN;k++)
      {
        sum += p_sum[k];
      }

The two codes were run on a Nehalem 3.2 GHz (desktop) processor, and a 
Shanghai 2.3 GHz desktop processor.  Here are the results

	Code		CPU	Freq (GHz)	Wall clock (s)
	------		-------	-------------	--------------

	base		Nehalem	3.2		20.5		
	optimized	Nehalem	3.2		6.72		
	SSE-ized	Nehalem	3.2		3.37

	base		Shanghai 2.3		30.3
	optimized	Shanghai 2.3		7.36 		
	SSE-ized	Shanghai 2.3		3.68
	
These are single thread, single core runs.  Code scales very well (is 
one of our example codes for the HPC/programming/parallelization classes 
we do).

I found it interesting that they started out with the baseline code 
performance tracking the ratio of clock speeds ... The Nehalem has a 39% 
faster clock, and showed 48% faster performance, which is about 9% more 
than could be accounted for by clock speed alone.  The SSE code 
performance appears to be about 9% different.

I am sure lots of interesting points can be made out of this (being only 
one test, and not the most typical test/use case either, such points may 
be of dubious value).

I am working on a Cuda version of the above as well, and will try to 
compare this to the threaded versions of the above.  I am curious what 
we can achieve.

Joe

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
        http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list