[Beowulf] Re: Matrix Multiply

Mon Feb 13 17:19:57 PST 2006

----- Original Message ----- 
From: "Dan Kidger" <daniel.kidger at quadrics.com>
To: <beowulf at beowulf.org>
Cc: "Tom Elken" <elken at pathscale.com>
Sent: Saturday, February 11, 2006 3:13 PM
Subject: Re: [Beowulf] Re: Matrix Multiply

> Tom Elken wrote:
>
>>> mathematician but im trying to understand how the benchmark operates.
>>> I would like to test my system by seeing how many FLOPS are achieved
>>> using only the Matrix Multiply.
>>
>> You could probably download the HPC Challenge benchmark (from 
>> http://icl.cs.utk.edu/hpcc/software/index.html )  and cut-paste some code 
>> from it's DGEMM (Double-precision, GEneral Matrix-Multiply) sub-benchmark 
>> as a relatively easy way to get a test program for matrix-multiply.
>>
>> DGEMM is typically ~90% or more of the HPL benchmark's profile.
>
> Indeed I have been doing that for the last couple of years since hpcc 
> appeared. It is trivial to slightly modify one file of the hpcc source 
> such that you can run just one or two or the seven contained benchmarks 
> via setting a shell variable - for example just to run dgemm in your case, 
> or say ptrans or gups in my case. It is also very easy to pipe the hpcc 
> output though a couple of lines of perl or sed so as to get just the 
> summary output lines for the subset of tests that you ran.
>
> As for the flops, this can depend on the lower bits of the matrix size - 
> it is common to see dgemm implimentations oscilate due to cache line hits 
> and the like.  rather than speak in terms of actual flops, it usuely make 
> more sense to quote the percentage of theoretical peak you get. The 
> theoretical peak is well defined - simply the cpu's clock speed mulitplied 
> up by favious factors like:
>  - times number of cpus per Motherboard
>  - times number of cores per cpu
>  - times number of floating point instructions issues per cycle (2 for 
> itanium and alpha, 1 for xeon/opteron)
>  - times width of any SIMD unit (2 for the 2*64-bit wide SSE2)
>  - times two if your FPUs can do chained muladd (like itanium)
>  - times 75% reduction factor if you can't issue floating point loads fast 
> enough (was this G5?)

>
>
> Some cpu architectures come out better than others but you should expect 
> to get say >85% even on the worst (thank you Mr Goto :-) )

- clockrate of a chip is a factor too
- L1 issues and weird habits of chip caches in general

Important is L1 cache.

Itanium2 can hide completely the latency in matrice because of a 1 cycle L1.
Opteron can nearly hide completely latency because of a 3 cycle L1 @ 2 ports
Xeon can't hide latency at all here as prescott is 4 cycle L1 @ 1 port

Because of this and other issues Xeon is effectively real slow here when you 
want to achieve high precision
by just multiply(-adding) registers and the worst choice for matrix 
multiply.

Now let's discuss single precision operations. Suddenly the PC's look a lot 
better then than Itanium hardware, as SSE2 can be of great help there 
suddenly issuing 2 multiplies within 1 cycle.

Actually all floating point software we use professionally which (lucky) 
partly is cooked in hardware is single precision.

Single precision is far more interesting than double precision.
SSE2 is great there.

Vincent

>
>
>
> Daniel.
>
> --------------------------------------------------------------
> Dr. Dan Kidger, Quadrics Ltd.      daniel.kidger at quadrics.com
> One Bridewell St., Bristol, BS1 2AA, UK         0117 915 5505
> ----------------------- www.quadrics.com --------------------
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>