[Beowulf] Nehalem and Shanghai code performance for our rzf example

Fri Jan 16 11:53:15 PST 2009

Hi Joe.
I guess it would be straight forward to get an openMP version run.
Can you please share your results on 1,2,4,8 threads ?
Use HT off on Nehalem.
Use thread affinity through environment variables or explicitly in the code.
Power management enabled or disabled, but disclosed.
Use SSE3 (Shanghai) and SSE4 (Nehalem).
Disclose also memory frequencies and number of dimms.
Disclose also compiler,version and flags.
Disclose also size of the arrays.

All that would allow people to reproduce and be able to comment.

Thanks,
Joshua

------ Original Message ------
Received: 12:12 PM CST, 01/16/2009
From: Vincent Diepeveen <diep at xs4all.nl>
To: landman at scalableinformatics.comCc: Beowulf Mailing List
<beowulf at beowulf.org>
Subject: Re: [Beowulf] Nehalem and Shanghai code performance for our rzf
example

> Note that single threaded performance doesn't say a thing,
> because when just 1 core runs, nehalem automatically overclocks 1 core.
> 
> A very nasty feature.
> 
> My experience is that Shanghai scales 4.0 nearly versus nehalem 3.2,
> because of the overclocking of 1 core.
> 
> So seeing a 9% higher IPC is not very weird.
> 
> Thanks,
> Vincent
> 
> On Jan 16, 2009, at 3:25 PM, Joe Landman wrote:
> 
> > Hi folks:
> >
> >   Thought you might like to see this.  I rewrote the interior loop  
> > for our Riemann Zeta Function (rzf) example for SSE2, and ran it on  
> > a Nehalem and on a Shanghai.  This code is compute intensive.  The  
> > inner loop which had been written as this (some small hand  
> > optimization, loop unrolling, etc):
> >
> >     l[0]=(double)(inf-1 - 0);
> >     l[1]=(double)(inf-1 - 1);
> >     l[2]=(double)(inf-1 - 2);
> >     l[3]=(double)(inf-1 - 3);
> >     p_sum[0] = p_sum[1] = p_sum[2] = p_sum[3] = zero;
> >     for(k=start_index;k>end_index;k-=unroll)
> >        {
> >           d_pow[0] = l[0];
> >           d_pow[1] = l[1];
> >           d_pow[2] = l[2];
> >           d_pow[3] = l[3];
> >
> >           for (m=n;m>1;m--)
> >            {
> >              d_pow[0] *=  l[0];
> >              d_pow[1] *=  l[1];
> >              d_pow[2] *=  l[2];
> >              d_pow[3] *=  l[3];
> >            }
> >           p_sum[0] += one/d_pow[0];
> >           p_sum[1] += one/d_pow[1];
> >           p_sum[2] += one/d_pow[2];
> >           p_sum[3] += one/d_pow[3];
> >
> >           l[0]-=four;
> >           l[1]-=four;
> >           l[2]-=four;
> >           l[3]-=four;
> >       }
> >     sum = p_sum[0] + p_sum[1] + p_sum[2] + p_sum[3] ;
> >
> > has been rewritten as
> >
> >     __m128d __P_SUM = _mm_set_pd1(0.0);        // __P_SUM[0 ...  
> > VLEN] = 0
> >     __m128d __ONE = _mm_set_pd1(1.);   // __ONE[0 ... VLEN] = 1
> >     __m128d __DEC = _mm_set_pd1((double)VLEN);
> >     __m128d __L   = _mm_load_pd(l);
> >
> >     for(k=start_index;k>end_index;k-=unroll)
> >        {
> >           __D_POW       = __L;
> >
> >           for (m=n;m>1;m--)
> >            {
> >              __D_POW    = _mm_mul_pd(__D_POW, __L);
> >            }
> >
> >           __P_SUM       = _mm_add_pd(__P_SUM, _mm_div_pd(__ONE,  
> > __D_POW));
> >
> >           __L           = _mm_sub_pd(__L, __DEC);
> >
> >       }
> >
> >     _mm_store_pd(p_sum,__P_SUM);
> >
> >     for(k=0;k<VLEN;k++)
> >      {
> >        sum += p_sum[k];
> >      }
> >
> > The two codes were run on a Nehalem 3.2 GHz (desktop) processor,  
> > and a Shanghai 2.3 GHz desktop processor.  Here are the results
> >
> > 	Code		CPU	Freq (GHz)	Wall clock (s)
> > 	------		-------	-------------	--------------
> >
> > 	base		Nehalem	3.2		20.5		
> > 	optimized	Nehalem	3.2		6.72		
> > 	SSE-ized	Nehalem	3.2		3.37
> >
> > 	base		Shanghai 2.3		30.3
> > 	optimized	Shanghai 2.3		7.36 		
> > 	SSE-ized	Shanghai 2.3		3.68
> > 	
> > These are single thread, single core runs.  Code scales very well  
> > (is one of our example codes for the HPC/programming/ 
> > parallelization classes we do).
> >
> > I found it interesting that they started out with the baseline code  
> > performance tracking the ratio of clock speeds ... The Nehalem has  
> > a 39% faster clock, and showed 48% faster performance, which is  
> > about 9% more than could be accounted for by clock speed alone.   
> > The SSE code performance appears to be about 9% different.
> >
> > I am sure lots of interesting points can be made out of this (being  
> > only one test, and not the most typical test/use case either, such  
> > points may be of dubious value).
> >
> > I am working on a Cuda version of the above as well, and will try  
> > to compare this to the threaded versions of the above.  I am  
> > curious what we can achieve.
> >
> > Joe
> >
> > -- 
> > Joseph Landman, Ph.D
> > Founder and CEO
> > Scalable Informatics LLC,
> > email: landman at scalableinformatics.com
> > web  : http://www.scalableinformatics.com
> >        http://jackrabbit.scalableinformatics.com
> > phone: +1 734 786 8423 x121
> > fax  : +1 866 888 3112
> > cell : +1 734 612 4615
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit  
> > http://www.beowulf.org/mailman/listinfo/beowulf
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>