[Beowulf] [landman at scalableinformatics.com: Re: [Bioclusters] FPGA in bioinformatics clusters (again?)]

Sun Jan 15 06:39:18 PST 2006

At 04:58 PM 1/14/2006, Joe Landman wrote:

>Jim Lux wrote:
>>At 08:52 AM 1/14/2006, Eugen Leitl wrote:
>
>
>>using masses of FPGAs is fundamentally different than uses masses of 
>>computers in a number of ways:
>
>There are a range of options for application acceleration, and no one 
>solution fits all models.  The rest of my posts in the other list covered this.
>
>>1) The FPGA is programmed at a lower conceptual level if you really want 
>>to get the benefit of performance.  Sure, you can implement multiple 
>>PowerPC cores on a Xilinx, and even run off the shelf PPC software on 
>>them, however, it would be cheaper and faster to just get Power PCs.
>
>I think that buying Xilinx cores to run PPC software is a serious abuse of 
>the power of the FPGA.  Basically an FPGA is a circuit.  A digital one, 
>but still a circuit.  If your application maps well into this, then you 
>can realize excellent benefits (with appropriate caveats, specifically, YMMV)

The PPC core is nice when you want to have just ONE device in something, 
rather than a microcomputer AND an FPGA. Typical applications would be some 
sort of robotics thing where you run the nav algorithms on the PPC, and the 
interfaces to the motors and sensors are more suited to an FPGA.

>>2) Most large FPGAs have very high bandwidth links available 
>>(RocketPortIO for Xilinx for instance), although, they're hardly a 
>>commodity generic thing with well defined high level methods of use (e.g. 
>>not like using sockets).  You're hardly likely to rack and stack FPGAs.
>
>There are some system on a chip + FPGA folks making this more interesting 
>and much easier.  Have a look at Stretch (www.stretchinc.com).  There are 
>others.  Its still not "plug and go" but it is getting better.

Stretch looks more like the "multiple transputer on a chip with 
programmable interconnects" sort of model.  This might actually be the way 
to go for a lot of high rate kinds of problems.  You've moved up one level 
of abstraction from raw gates (or blocks of IP).

Of course stretch is pretty new, and their website is fairly detailed 
content free.

>>3) Hardware failure probabilities are probably comparable between FPGAs 
>>and conventional CPUs.  However, you're hardly likely to get the 
>>economies of scale for FPGAs.
>
> From some of the builders of these cards I have spoken to, run rates of 
> 100 cards are considered large.

That  might even be huge.  So far, FPGAs tend to get used in products that 
are highly specialized (e.g. a video compressor box).  The generic FPGA 
boards (eval boards) are probably in the dozens per year volume (based on 
how long it takes to get one when you buy it).

>>There isn't a well developed commodity mobo market for FPGAs, that you 
>>can just pick your FPGA, and choose from among a half dozen boards that 
>>are all essentially identical functionally.
>
>Well there are "some" but the FPGA world started in signal processing, so 
>lots of the "generic" boards you can chose from have lots of other stuff 
>you don't need.  And you are right, there is no standard interface to 
>memory, or to IO.  This means that if you have a new version of the board, 
>you have to port to your new version.  Similar to putting down a new bios 
>to a degree, just different.

The boards that are available tend to be in the nature of "eval boards".. 
something you can buy for not too much money that allows you to develop 
your algorithms before committing to a real board design for your 
product.  To make it "end user useful", you'd typically need quite a bit of 
glue and infrastructure.

>>What "generic" FPGA boards exist are probably in the multikilobuck area 
>>as well.
>
>I haven't seen too many "generic" boards for HPC.  Lots for DSP and related.

True enough.  The boards are starting to be more module oriented, so you 
can buy just the FPGA and interface part, with the analog/digital stuff as 
modules that are added on.  But even so, it's hard to find something that 
isn't in the classic DSP model of 2 A/Ds, FPGA, 2 D/As..   (I was recently 
looking for an inexpensive board that could do 5 A/Ds... nope, no-way, etc.)

>>4)  There are applications for which FPGAs excel, but it's likely that a 
>>FPGA solution to that problem is going to be very tailored to that 
>>solution, and not particularly useful for other problems.  The FPGA may 
>>be a perfectly generalized resource, but the system into which it is 
>>soldered is not likely to be so.
>>Joe's analogy to video coprocessors is apt.  Very specialized to a 
>>particular need, where they achieve spectacular performances, especially 
>>in a operations per dollar or operations per watt sense.  However, they 
>>are very difficult to apply in a generalized way to a variety of problems.
>>Of course, the video coprocessor is actually an ASIC, and is essentially 
>>hardwired for a particular (set) of algorithms.  You don't see many video 
>>cards out there with an FPGA on them, for the reason that the 
>>price/performance would not be very attractive.
>
>The interesting thing is the utilization of the shaders by folks seeking 
>higher performance math and pipeline processing.  This suggests that for a 
>well enough designed system, with an effectively standardized way to 
>access the resources, there may be places where a 
>micro-accelerator/co-processor might add value to a code.  The early 
>x86-x87 pairs were like this.  An attached co-processor.  I don't think 
>you are going to want to try to implement a large general purpose 
>processor on an FPGA, as they don't have enough gates to be 
>interesting.  But a highly specialized co-processor that has some sort of 
>highly focused functionality has been useful in other areas.

The trick is in finding a set of operations that a large enough group of 
people will be interested in.  The FP coprocessor was easy.. everybody 
needs floating point, the set of functionality is fairly easy to come up 
with.  But even there, there was a heap o' whining about the 287 or Weitek 
interfaces, incompatibility with some vendors, etc.... And, in not too 
long, the FPU merged into the processor.

>>(mind you, if you've got the resources, and a suitable small set of 
>>problems you're interested in, developing ASICs is the way to go.
>>D.E.Shaw Research is doing just this for their computational chemistry 
>>problems.)
>
>ASICs cost more than FPGAs to developand you are committed to a 
>design.  ASICs will be much less expensive in high volumes.

And, ASICs are faster and lower power than FPGA implementations.  Figure a 
couple million bucks to crank out a decent ASIC.  So, if you've got an 
application that will use 10,000 of them, they're $100/each.  When you get 
to thousand widget scales, things like power become important.

>>5) FPGA development tools are wretchedly expensive, compared to the tools 
>>for "software". It's a more tedious, difficult and expensive development 
>>process.
>
>Actually the compilers are getting reasonable (using Pathscale EKO as a 
>definition of reasonable), they aren't 100x ot 10x of that price. 
>Something in the 2-5x region.  However, and this is important, the 
>compilers aren't as good (yet) as humans for this work.

I'd compare current FPGA tools to like working with an old style monolithic 
assembler and a big book with subroutine libraries that you can cut and 
paste into your source code.

Even relatively simple designs (say, 4 parallel phase locked loops and an 
interface to a host) can take more than an hour to synthesize and compile 
into a bit stream ready for loading into the target.  For BIG designs, it's 
start it before you go home and hope it doesn't throw an error before you 
get in in the morning.

>>There's a lot more software developers than FPGA designers out there, so 
>>it's harder to It's resynthesize EVERYTHING, and then you have to 
>>revalidate all the timings, regenerate test vectors, etc.  There's no 
>>equivalent of "patching".
>
>In general yes.  Have a look at the Stretch bit.  It is quite 
>interesting.  They have some issues as well (their FPGA is small).  But it 
>is intriguing.

Yep.. the stretch is interconnected processors, from what I can figure out, 
so it's naturally partititioned.  And maybe that's the key to this sort of 
thing.

I'd venture to say, though, that there's probably only a half dozen people 
in the world who know how to "program" the stretch widget.  Much different 
than for traditional software development, where it's almost at the stage 
where there are homeless people at freeway onramps with cardboard signs 
"Will C++ for food". (no, I'm wrong, it's "will javascript for food")

We might look to products like the transputer here.. Really nifty, not too 
tough to program, blindingly fast, but, eventually got run over by the ever 
increasing speed of the host.

>>James Lux, P.E.
>>Spacecraft Radio Frequency Subsystems Group
>>Flight Communications Systems Section
>>Jet Propulsion Laboratory, Mail Stop 161-213
>>4800 Oak Grove Drive
>>Pasadena CA 91109
>>tel: (818)354-2075
>>fax: (818)393-6875
>
>
>
>
>--
>Joseph Landman, Ph.D
>Founder and CEO
>Scalable Informatics LLC,
>email: landman at scalableinformatics.com
>web  : http://www.scalableinformatics.com
>phone: +1 734 786 8423
>fax  : +1 734 786 8452
>cell : +1 734 612 4615

James Lux, P.E.
Spacecraft Radio Frequency Subsystems Group
Flight Communications Systems Section
Jet Propulsion Laboratory, Mail Stop 161-213
4800 Oak Grove Drive
Pasadena CA 91109
tel: (818)354-2075
fax: (818)393-6875