[Beowulf] HPL Benchmarking and Optimization

Wed Apr 2 21:07:30 PDT 2008

Get for AMD based systems ACML and gcc,pgi or pathscale
Get for Intel based systems MKL and intel compiler
run N problem size around 90% workload.is, 1.8GB per core memory footprint.
Run NB 192 on AMD, I don't know the best blocking factor for MKL. I've tried
the same 192 and does fairly well.
Set affinity for the mpi even with 1 socket runs.
Run PxQ 2x2,2x4,4x4,.. depending on the number of cores.
With the above you should get on AMD and on Intel at least 77% efficiency.
As suggested by Tom, Goto library will give you good performance as well.
You can try also the multithreaded version so use PxQ=1x1 and
OMP_NUM_THREADS=4 for a single socket quadcore.
Reduce misses with huge pages.
If you get below 75% efficiency, you are doing something wrong.
If you do more than 85% on quadcore, please let me know :)

Regards,
Joshua

------ Original Message ------
Received: Wed, 02 Apr 2008 12:33:25 PM PDT
From: Ellis Wilson <xclski at yahoo.com>
To: beowulf at beowulf.org
Subject: Re: [Beowulf] HPL Benchmarking and Optimization

> Ellis Wilson wrote:
> > Currently I get these kind of numbers from tested
> > computers using the 
> > same environment (gentoo, fortran in gcc, hpl, all
> > same compilation 
> > options):
> > 1 x Core2Duo (2.1ghz/core, 2gigs ram) - 2.3Gflops
> > 1 x Athlon 64 3500+ (2.2ghz, 1gig ram) - 1.0Glops
> > 4 x Core2Duo (2.1ghz/core for a total of 8 cores,
> > 2gigs ram/node, 
> > 100mbit Ethernet interconnect) - 6.7Gflops
> 
> Sorry to double post all, however, I realized my issue
> involved running 
> HPL on the reference library of BLAS that is generic
> for every 
> architecture and didn't want to waste anyones time. 
> Giving Portage the 
> benefit of the doubt, I had failed to check that it's
> dependencies were 
> best for HPL.  Following an install of ATLAS and
> relinking to its 
> libraries, I've gotten the following numbers:
> 1 x Athlon64 3500+ (2.2ghz, 1gig ram) - 3.6GFlops
> 1 x Phenom9600 Quadcore (2.3ghz/core, 2gigs ram) -
> 11.9GFlops
> 
> I'll likely try MKL soon for the Intel processors I'm
> interested in.
> 
> The phenom9600 had previously only gotten 4.5 GFlops,
> and when I tested 
> it the second time I simply used the same environment
> I had compiled for 
> the athlon64.  Certainly compiling ATLAS native on the
> phenom will 
> increase the result, hopefully about 350% like with
> the athlon64 (though 
> I suspect things will be interesting due to bandwidth,
> etc for quadcores).
> 
> Anyway, not to end the thread I still am wondering:
> 
> Do those of you who have professional installations or
> even simply large 
> setups that are unsure of the exact code which will be
> run upon your 
> cluster utilize compilation options such as -O3,
> funroll-loops, 
> -fomit-frame-pointer, etc?
> 
> Thanks,
> 
> Ellis
> 
> 
> 
> 
>      
____________________________________________________________________________________
> You rock. That's why Blockbuster's offering you one month of Blockbuster
Total Access, No Cost.  
> http://tc.deals.yahoo.com/tc/blockbuster/text5.com
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>