[Fwd: Benchmarking Beowulf]
R. Munk Larsen
rmunk@quake.Stanford.EDU
Fri, 25 Jun 1999 18:51:43 -0400
Greg,
Your letter touches on a fundamental problems in high-performance
computing, so excuse me for giving a long answer.
>
> Yes. It's handy to call that number "MachoFLOPS" in order to make this
> clear. "BogoMIPS" are similar.
>
Sure, I think we agree on the general theme. A similar expression is
"embarrasingly parallel".
>
> I disagree. Few real programs use a lot of GEMM, and the rates you get
> for GEMM are often 3x or 4x the performance that you get on real
> problems.
> <snip...snip>
Well, if your real program uses simple handcoded routines or obsolete
software such as LINPACK, EISPACK or (God forbid!) Numerical Recipes
based on BLAS level 1 and 2 (see below) for dense linear algebra then
you are asking for trouble, and the performance degradation compared
to GEMM of x3 to x4 that you mention is typical. (Yeah, I know. Not
all codes are dominated by linear algebra - see below).
>
> Uhuh. In my years of scientific computing, I've never worked on a code
> for which this was the case. I suppose there are fields for which
> large-size GEMM performance is important, but I'm hard pressed to say
> which...
>
If you have spent "years in scientific computing" I am surprised to
learn that you have never come across an application that involved,
* systems of linear equations
* linear least squares problems
* eigenvalues and eigenvectors
* singular value decomposition
of dense matrices. The state-of-the-art algorithms found in LAPACK and
ScaLAPACK for solving these problems are all build on block-algorithms
where the dominating computational steps can be formulated as BLAS-3
operations (such as a rank-k update C <- alpha*A*B + beta*C, which is
implemented by GEMM). You can substitute "BLAS-3 operation" for
"GEMM" in my previous posting if that makes you feel better. I
recommend [2] and the LAPACK manual [1] and references therein for
further description.
Even for sparse linear systems a lot of work has gone into
developing so-called frontal schemes that work on dense blocks, thus
enabling the use of (almost) BLAS-3 primitives, see e.g. [3].
I gladly admit that coming from numerical linear algebra I am
somewhat biased, and that many codes, including iterative solvers for
linear algebra, do involve a substantial fraction of BLAS-1, BLAS-2
and Fourier transforms, say, which run slower than BLAS-3 (GEMM) on a
machine with a memory hierarchy. I am not saying that all algorithms
can be formulated in terms of BLAS 3 and thus give a performance
comparable to GEMM, but there are many important applications
(mentioned above) where this is possible. Sometimes it requires
rethinking the fundamental structure of the implementation, but in
some cases it suffices to replace a LINPACK, EISPACK or Numerical
Recipes routine with the equivalent LAPACK routine. If you do not have
the ressources to do this, or is forced to run some dusty deck C or
FORTRAN code, I agree that the only thing that will help you is to go
for the newest, fastest Alpha CPU and the best compiler you can get
your hands on.
Best regards,
Rasmus
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel,
J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney,
O. Ostrouchow and D. Sorensen, {\em LAPACK Users' Guide}, 2nd ed.,
SIAM, Philadelphia, 1995, {\tt http://www.netlib.org/lapack/}.
[2] J.J. Dongarra, I.S. Duff, D.C. Sorenson and H. van der Vorst,
{\em Numerical Linear Algebra for High-Performance Computers},
SIAM, Philadelphia 1998.
[3] J. Demmel, S.C. Eisenstat, J.R. Gilbert, X.S. Li and J.W.H. Liu,
{\em A supernodal approach to sparse partial pivoting}, Technical
Report UCB//CSD-94-883, Computer Scince Divicion, U.C. Berkeley,
Berkeley, California, July 1995.