[Beowulf] Teraflop chip hints at the future

Fri Feb 16 15:10:22 PST 2007

Richard Walsh wrote:

>    Here you are arguing for an ASIC for each typical HPC kernel ... ala
> the GRAPE processor.  I will buy that ... but
>    a commodity multi-core, CPU is not HPC-special-purpose or low power
> compared to an FPGA.

FPGA power is good, several Watts in most cases.  When you don't have to
power extra cruft things are good.  Latest quad core from AMD/Intel are
in the 20W/core region (30 for the current Intel, 20 for the new gen).
It would not surprise me to see this get to 10W/core and below.

>> purpose CPUs (PowerPCs), DSPs (ADSP21020), and FPGAs for some signal
>> processing applications.  At that time, the DSP could do the FFTs,
>> etc, for the least joules and least time.  Since then, however, the
>> FPGAs have pulled ahead, at least for spaceflight applications.   But
>> that's not because of architectural superiority in a given process..
>> it's that the FPGAs are benefiting from improvements in process
>> (higher density) and nobody is designing space qualified DSPs using
>> those processes (so they are stuck with the old processes).
>    Better process is good, but I think I hear you arguing for
> HPC-specific ASICs again like the GRAPE ... if they
>    can be made cheaply, then you are right ... take the bit stream from
> the FPGA CFD code I have written and tuned, and
>    produce 1000 ASICs for my special purpose CFD-only cluster.  I can

This sounds like D.E.Shaw's work (though I think they are doing it in FPGA)

> run it at higher clock rates, but I may need a
>    new chip every time I change my code.

You need a new bitfile everytime you change FPGAs or FPGA boards.  This
means that FPGA bitfiles are largely immobile.  Of course the process to
change the bitfile is a rebuild and ...

>> Heck, the latest SPARC V8 core from ESA (LEON 3) is often implemented
>> in an FPGA, although there are a couple of space qualified ASIC
>> implementations (from Atmel and Aeroflex).
>>
>> In a high volume consumer application, where cost is everything, the
>> ASIC is always going to win over the FPGA.  For more specialized
>> scientific computing, the trade is a bit more even ... But even so,
>> the beowulf concept of combining large numbers of commodity computers
>> leverages the consumer volume for the specialized application, giving
>> up some theoretical performance in exchange for dollars.
>     Right, otherwise we would all be using our own version of  GRAPE,

Some things can be specialized and made fast.  GPUs.

> but we are all looking for "New, New Thing"
>     ... a new price-performance regime to take us up to the next level. 
> Is it going to be FPGAs, GPGPUs, commodity
>     multi-core, PIM, or novel 80-processor Intel chips.  I think we are
> in for a period of extend HPC market
>     fragmentation, but in any case I think two features of FPGA

I am not convinced it is going to be fragmented for long.  Take
everything more expensive than $5000US and call it DOA unless it can
easily drop right in and hit 10-100x node performance.  Node pricing is
dropping rapidly.  A 5+ TF cluster quoted several months ago using
previous generation technology came in around a few million $.  One
quoted recently came in well under $1M.

> processing, the programmable core and data flow
>     programming model have intrinsic/theoretical appeal.   These forces
> may be completely overwhelmed by other
>     forces in the market place of course ...

Unless GPUs just won't work, they may be a safe bet as one of the
emerging winners.  Cell should be in there as well.

We demo'ed a little FPGA board (disclosure: we work with the company
that builds it, and we do sell it) that attached to a USB2 port, that
ran HMMer faster than an 8 core cluster.  The cost and power difference
is huge there, but hopefully we will be able to run p7Viterbi fast on
GPUs.  Then economies of scale may be able to drive some of this into
motherboards, though most MB makers are reluctant to add anything that
increases the cost of their product.  Even if it is better and makes
their product stand out.  Graphics cards are in *everything* so you
should pretty much expect them to be one of the winners, if they can get
the codes to run on them.  Cell-BE's are going into millions of PS3s,
and I while it might be a stretch, it is possible that some places may
deploy clusters of these (PSC deploying a PS3 cluster?  :) ).

What is pretty clear right now is that anyone with an excessively high
price per unit or per SDK, is pretty much knocking themselves out of the
market.  Anyone who cannot build and create large volumes of these
things is pretty much in trouble in this space.

The other thing that is pretty clear is that as the multi-cores go even
more multi, chips that hyperspecialize in one area may become
marginalized.  There is some data I am not sure if I can talk about, so
I'll talk about the other data that I can.  The Intel quad core units
can do something like 35 GF/socket (rough calc, I am sure some Intel
person can correct me, so please do).

This is good, though it puts pressure on the hyperspecialized chips.

Joe

> 
>     Regards,
> 
>     rbw
> 
> 

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615