kus at free.net
Mon Sep 1 10:34:53 PDT 2008
I performed some simplest estimation for possible performance
improvements using "dgemm on FirerStream 9250".
It's extremally good for GPGPU example.
The source data for 9250: peak DP performance 200 GFLOPS, GDDR3 RAM 1
1 Gbyte can hold 3 DP(64 bit) matrixes (n x n) for n=6000 - they
require 864 Mbytes.
Let me suppose that real performance of FireStream will be 90% of peak
value (I'm afraid, that reality will be more bad), i.e. 180 GFLOPS.
dgemm requires 2*n^3 FP operations (I neglect n^2 operations for
matrix addition and scaling), i.e. 432 GFLOP
The calculation time will be 432/180 = 2.4 sec
We'll need for dgemm calculation also 4 matrix transmissions: 3 to
GPGPU, 1 - from GPGPU to main memory of server.
It's 1152 Gbytes of data.
For PCI-e x16 v.2 peak throughput value is 8 GB/s, therefore
transmission time will be about 0.144 sec (I don't know what may be
real throughput for PCIe).
The total calc. time is therefore about 2.54 sec.
On dual socket quad core Xeon server w/3 Ghz E5472 (8 cores) the peak
performance is 96 GFLOPS. Parallelized dgemm will give, I believe,
about 80% of peak - i.e. 77 GFLOPS; therefore calcualtion time is
432/77= 5.6 sec.
Speedup is 2.2 times. Price increase - I don't know, for example from
$4500 to $6500 (if Firestream costs $2000, but may be $1000 as Igor
Kozin wrote here), it's about 1.4 times.
But I think there will be not too many job which require matrix
multiplication for *dense* matrixes w/such large (6000 x 6000) sizes;
for sparse matrixes the dimensions, I beleive, will be lower.
More information about the Beowulf