[Beowulf] coprocessor to do "physics calculations"

Sun May 14 13:28:10 PDT 2006

Mark Hahn wrote:
>> Didn't see anyone post this link regarding Aegia Physix processor. It is the most comprehensive write up I have seen.
>>
>> http://www.blachford.info/computer/articles/PhysX1.html
> 
> yes, and even so it's not very helpful.  "fabric connecting compute and
> memory elements" pretty well covers it!  the block diagram they give
> could almost apply directly to Cell, for instance.
> 
> fundamentally, about these cell/aegia/gpu/fpga approaches,
> you have to ask:
> 
> 	- how cheap will it be in final, off-the-shelf systems?  GPUs
> 	are most attractive this way, since absurd gaming cards have 
> 	become a check-off even on corporate PCs (and thus high volume.)
> 	it's unclear to me whether Cell will go into any million-unit 
> 	products other than dedicated game consoles.

This will drive prices for the Cell way down.  Volume has a habit of 
helping do that.  FPGAs will likely remain several thousand dollars per 
unit (Virtex 4 and above) unless you can drive many units, in which case 
you have to start looking at the economics of ASICs if your algorithm 
never changes.  If you have frequently changing algorithms, or want to 
build a special processor per code, then you need the programmability of 
the FPGA.  In order for this to make sense from a price point of view, 
you have to see what the overall performance you get out of it.  Few 
people (I think) would be willing to pay $10kUSD for 10x performance 
delta, though I would think that closer to 100x delta, this price 
wouldn't be an issue.

> 	- does it run efficiently-enough?  most sci/eng I see is pretty
> 	firmly based on 64b FP, often with large data.  but afaikt, 

Numerical stuff is pretty much DP FP right now.  I saw one of the FPGA 
grape units running a stellar dynamics simulator at SC05.  If you are 
willing to give up IEEE754/854 for performance, you can do some pretty 
amazing things.

> 	Cell (eg) doesn't do well on anything but in-cache 32b FP.

The idea with Cell and pretty much all APUs (acceleration processing 
units) out there today is you need to double buffer and constantly 
stream data in.  This limits which algorithms they can work on, though 
not terribly so.

> 	GPUs have tantalizingly high local-mem bandwidth, but also 
> 	don't really do anything higher than 32b.

Single precision isn't so bad for many calculations.  You would be 
surprised how many of the Auto companies run long crash simulations this 
way.  There are other considerations than base data type accuracy that 
can swamp the calculations.

> 	- how much time will it take to adapt to the peculiar programming
> 	model necessary for the device?  during the time spent on that,
> 	what will happen to the general-pupose CPU market?

Yes.  This is why any APU must be easy to program.  Non-programmable 
APUs or minimally programmable units (fixed function units) are doomed 
to niches at best.  You need to be able to turn your codes around on it 
very quickly, in a time comparable to days, not months of Verilog/VHDL.

> I think price, performance and time-to-market are all stacked against this 
> approach, at least for academic/research HPC.  it would be different if the

I disagree.  On specific codes (possibly not FP heavy right now if we 
are talking about FPGAs), the price performance will be difficult to 
beat, the performance will be difficult to beat.  The time to market is 
critical.  Part of this is accelerator card design.  Part of it is ease 
of spinning new applications.  Application turn around time cannot 
exceed something close to a month, or no one will do it.

For various informatics codes, you can get 100-300x type performance 
deltas (I have seen 300x reported in papers, others have reported 
higher).  If you can get 100x better performance by adding in a $10kUSD 
board, would you do it?

For chemistry codes and other FP heavy codes, you need a DP (64b) 
accelerator.    FPGAs don't make good DP FP units right now, IEEE754 is 
expensive in terms of gates.  You can't get enough of them on there. 
Best I have heard of is the SRC MAP processor which had something like 
100 units, running at 150 MHz, that could just eek out 11 GFlops.  As 
this is comparable to dual core Opteron, this is not the way you want to 
go for double precision floating point.  There are other options (now 
and coming on line).

> general-purpose CPU market stood still, or if there were no way to scale up
> existing clusters...
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452
cell : +1 734 612 4615