[Beowulf] gpgpu

Mon Sep 1 10:34:53 PDT 2008

I performed some simplest estimation for possible performance 
improvements using "dgemm on FirerStream 9250".
It's extremally good for GPGPU example.

The source data for 9250: peak DP performance 200 GFLOPS, GDDR3 RAM 1 
Gbyte.

1 Gbyte can hold 3 DP(64 bit) matrixes (n x n) for n=6000 - they 
require 864 Mbytes.
Let me suppose that real performance of FireStream will be 90% of peak 
value (I'm afraid, that reality will be more bad), i.e. 180 GFLOPS.

dgemm requires 2*n^3 FP operations (I neglect n^2 operations for 
matrix addition and scaling), i.e. 432 GFLOP
The calculation time will be 432/180 = 2.4 sec

We'll need for dgemm calculation also 4 matrix transmissions: 3 to 
GPGPU, 1 - from GPGPU to main memory of server.
It's 1152 Gbytes of data.

For PCI-e x16 v.2 peak throughput value is 8 GB/s, therefore 
transmission time will be about 0.144 sec (I don't know what may be 
real throughput for PCIe).

The total calc. time is therefore about 2.54 sec.

On dual socket quad core Xeon server w/3 Ghz E5472 (8 cores) the peak 
performance is 96 GFLOPS. Parallelized dgemm will give, I believe, 
about 80% of peak - i.e. 77 GFLOPS; therefore calcualtion time is 
432/77= 5.6 sec.

Speedup is 2.2 times. Price increase - I don't know, for example from 
$4500 to $6500 (if Firestream costs $2000, but may be $1000 as Igor 
Kozin wrote here), it's about 1.4 times. 

But I think there will be not too many job which require matrix 
multiplication for *dense* matrixes w/such large (6000 x 6000) sizes; 
for sparse matrixes the dimensions, I beleive, will be lower.

Mikhail