IBM ASCI White

Thu Jul 6 11:44:31 PDT 2000

On Thu, 6 Jul 2000, Greg Lindahl wrote:

> > The $/GFlop is pretty good too, $8.9K/Gflop.  Has anyone
> > beat this?
> 
> Of course -- it's hard to build a cluster that costs that much for list
> price. The FSL system was cheaper than that.

I'd second this.  Whatever a "GFLOP" is in a cluster environment
(apparently, by agreement, the simple aggregate sum of the single CPU
GFLOPs, which didn't mean much in the first place;-).  900 MHz Athlons
can deliver just about exactly a billion floats per second for simple
loops that live in cache (see below), for a node cost of order $1-1.5K.
Dual high clock PIII's can likely deliver just about a billion floats
per second for perhaps $2.5K (I'm not being very careful in getting
absolutely current pricing, so don't sue me if I'm off by a few hundred
dollars).

Oh -- my definition of a GFLOP is COUNT*SIZE=250 million of the
following in a second:

for(i=0;i<SIZE;i++){
  x[i] = 1.0;
}

for(k=1;k<=COUNT;k++){
        for(i=0;i<SIZE;i++){
                x[i] = (1.0 + x[i])*(2.0 - x[i])/2.0;
        }
}

Which is one addition, one multiplication, one subtraction and one
division (plus loop and address arithmetic that I try to compensate for
in the timing) and is stable (the final x[i] = 1.0) within system
roundoff so you can do it a lot of times.  SIZE needs to be small enough
for x[i] to fit into L1 cache.  Then one can figure out the effect of
cache, and so forth.

Sure, it's not a LINPACK GFLOP.  Nor does it tell one much about e.g.
trancendentals, the effect of L1 and L2 cache speeds and latencies and
main memory bandwidths and latencies, the effect of context switches and
much much more.  However, it is very close to what most people think of
when they speak of a "floating point operation" and it is a real-world
measurement made with actual compiled code, not a theoretical peak.
FWIW, a 400 MHz PII comes in at about 250 MFLOPS and a 667 MHz alpha
comes in about 800 MFLOPS (using the digital compiler).

Perhaps we're not QUITE at 1 MFLOP/dollar yet.  However, we're within
spitting distance of it -- even if one (cynically enough;-) degrades
these measurements by a factor of two or four we're within a factor of
two to four.  So the IBM price/performance is high by (surprise)
approximately a factor of 2-4, depending of course on what the GFLOP
rating is with this simple measure.

[Those who don't like my GFLOPS are welcome to dislike them, BTW -- I'm
only moderately fond of them myself.  Standard Disclaimer:  The only
meaningful benchmark is YOUR APPLICATION.  The rest of them are just tea
leaves.]

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu

(Note -- I subtract out the empty loop time).

#============================================================
# cbench benchmark run on host webtest
# CPU = Athlon at 900 MHz, Total RAM = 64
# Time (empty) = empty loop
# 5.55user 0.01system 0:05.58elapsed
# Time (full) = (10 Billion flops)
# 14.77user 0.00system 0:14.77elapsed

speed = 10B/(14.77 - 5.55) = 1085 MFLOPS.  Astounding.  Must pipeline
the floats at least.  Too bad this speed isn't reflected in my Monte
Carlo routines...

#============================================================
# cbench benchmark run on host b1
# CPU = PII at 400 MHz, Total RAM = 512
# Time (empty) = (empty loop)
# 12.67user 0.00system 0:12.67elapsed
# Time (full) = (10 Billion flops)
# 53.46user 0.00system 0:53.45elapsed

speed = 10B/(53.46 - 12.67) = 245 MFLOPS.  Not bad.

#============================================================
# cbench benchmark run on host qcd1
# CPU = alpha_21264 at 667 MHz, Total RAM = 512
# Compiler is ccc, forced to actually execute the loop.
# Time (empty) =
# 0.00user 0.00system 0:00.00elapsed
# Time (full) = (doing 10 Billion FLOPS)
# 12.93user 0.00system 0:12.94elapsed

SO, speed = 10G/(12.93 - 0.0) = 773 MFLOPS.  Not bad at all.  Of course
it COSTS a whole lot more per FLOP than an Athlon or P6...