[Beowulf] gpgpu

Tue Aug 26 10:16:09 PDT 2008

Well doing DP calculations at a GPU is a bit power and programmers  
waste for now Geoff.

Remember, it's a 250 watt TDP monster that GPU, versus a Xeon 45 nm  
chippie is 50 watt TDP.

How many units can do DP calculations? Like 40 or so.

According to my calculation that is practical a limit of :
   0.675Mhz * 40 units * 2 DP each 128 bits = 80 * 0.675 = 54 gflop.

Sounds like a lot to you?

CELL node is doing 150 (ok granted that's 2 cpu's) and i assume you  
have libraries that are optimized to do calculations for you.

Quadcore 2.5ghz 50 watt TDP Xeon ==
    2 ipc * 2 DP each 128 bits * 2.5ghz * 4 cores = 40

A quad Xeon is 80 gflops on paper therefore.

So for hardware that is real tough to program for with 2GB ram or so  
you can get 54 gflop for like 1000 euro each Tesla box or so.
Maybe add 20% import tax as well == 1200 euro an unit.

Compare for 1200 euro you can buy a dual xeon 2.5Ghz box with 8GB ram  
or so, and it has a TDP of 2x50 == 100 watt,
delivering 80 gflop, and you can run the most optimized SSE2 codes  
that exist on planet earth, whereas you have to hire a good
CUDA programmer or toy yourself getting that FFT done somehow in a GPU.

Those CUDA programmers have major problems writing good complex  
codes, as they do not get any technical specs on the hardware
from Nvidia.

Or do  you happen to know which hardware instructions the GPU has and  
how latency to RAM is when all units read device
memory simultaneously?

You can write code in several manners, which will trigger other  
instructions.
Is shifting right and shifting left the same speed?
Can it do rotations of registers?

In short you will lose in advance a factor 4 in speed somewhere, as  
you've got no clue how to exactly  optimize for the
hardware.

I've considered the GPU for prime numbers instead of game tree  
search, but my FFT code has been written in C code,
and therefore already existing software that's total SSE2 optimized  
is kicking it majorleague.

For gametree programming having technical specs on how fast one can  
out of a hashtable of 1GB memory randomly lookup
a cacheline (256 bytes i assume) to the 460 stream processors  
simultaneously (so each stream processor
looking up a random cacheline, there MIGHT be overlap, but doesn't  
need to be, at overlap can be at most for
4% of the lookups overlap at any given moment, which is an  
upperbound, and when taking 2GB buffer instead of 1GB of
course the odds of that happening is already quite a tad tinier).

For nearly everything that can profit from 32 bits and is a tad  
mature code (so not one-eyed in the land of the blind),
a big cache speeds up calculations bigtime.

Without answer to such a simple latency question i'm not even gonna  
take the effort to buy a card.

The 5 seconds of system time i've got at maximum to run CUDA at this  
laptop (nvidia 8600M GT inside with 32 stream processors),
is not interesting for serious testing.

Nvidia sells GPU's. For whatever reason you can program them. But the  
core business is selling GPU's.
GPU's need basically sngle precision floating points.

You can get any coordinate using 2 single precision floating points,  
which is what some software has standardized upon.

So for all kind of software that has to do with graphics, single  
precision is enough.

It is wishful thinking that you can buy therefore a GPU and do your  
DP calculations faster or at a lower powercost
than at hardware from IBM/Intel/AMD. Those calculatoins sell a lot  
for ibm/intel/amd, millions of cpu's. Nvidia isn't selling
millions of gpgpu cards. They're selling graphics card that  
additionally you can also program.

There is very little software that can profit from the crunching  
power it delivers. You cannot let the cards run too long.
After 1.5 years it is outdated and just eats too much power.

In short there is a very limited number of applications that can get  
ported to those GPGPU's.
That doesn't mean they're useless. In contradiction, they're very  
helpful for *those* applications.

The big problem is the software. It is difficult to make software for  
them, because basically the GPGPU's have one big contradiction.
It's a complete stream environment yet the amount of RAM you can  
stream to/from is real limited.

So within that limited data you need to stream, whereas you don't  
have enough streaming bandwidth to the RAM.
There is a claim there is a stream bandwidth of 100GB/s or so to the  
RAM.

Now that's fun of course, but the 460 processors can retire 16 bytes  
per cycle. At 675Mhz that's:
    460 * 0.675G * 16 bytes = 5Terabytes per second.

So there is a factor 50 problem nearly of how ideally such stream  
processors get used (to stream, note watch their name)
versus what you can effectively stream to the ram.

The solution for that in most algorithms is to treat each processor  
independantly. You can easily use the cache then.

However each streamprocessor cannot get treated independantly. Say a  
block of 32 streamprocessors you can treat
independantly. That means those must execute the same code and take  
the same branches in the code.

Solving all that is hell of an algorithmic job and a big challenge  
for a guy like me.

Yet i have 0 contract jobs there currently. Seems no one wants to pay  
for algorithmic solutions/ideas. They just steal them.

If i look around to those who do CUDA programming regurarly, there is  
sometimes like 1 small assignment where all go for,
say some scientist who wants to give 1000 dollar to get something done.

That's not very inspiring for GPGPU programming and the results that  
you hear about are not very inspiring at all.

Most of those attempts are more of the kind like: "suppose we sell a  
product and by accident an user has a nvidia card in his
box, then we can also use that GPU as a 'co processor', to do some  
job, that otherwise we let the main processor do".

So they're not doing that to let their code run faster or so. It's  
just "profitting" from the fact that some users have a Nvidia gpu.

As for professional usage of it, where you really want to speedup  
codes as compared to a 8 core Xeon box,
this is real complicated.

First of all you need to buy tesla boxes to get some bandwidth to the  
RAM, and those aren't cheap.
If after 1.5 years you have a company which *might* consider giving  
you a contract job to port some
sort of generic simulator framework and you ship an email to nvidia  
for support,
so that at least you can tell that company, which already is really  
reluctant to bet at 1 specific company.
Governments always overdo things on paper; they want their software  
generic to avoid
dependencies 10 years from now as those F22's, F35 in fact, will be  
in service for a real long time to come.

Nvidia will tell you that only AFTER you sign a deal with them to get  
zillions of boxes of them,
they might support it and then only for a short time.

That's not how reality works and simply takes care no European  
government will seriously consider those GPU's in a systematic
manner to get used. All what happens is 1 guy who is interested in it  
and is doing some study after it.

It is difficult already to show ON PAPER a GPGPU solution to offer  
advantages over dual socket boxes,
let alone coming quad socket AMD boxes, knowing that your software  
that works well at a 16 core AMD box,
will run even better at the next generation AMD or intel CSI box.

First SHOW someone that THEIR softwares framework calculation gets  
executed real fast for them, in some sort of 2 person
pilot project, and THEN maybe get a contract job, whereby they buy  
themselves hardware directly from the manufacturers.

How many they buy and what type of simulations they run, that's all  
none of my business; i don't even want to know.

Porting code and existing working algorithms is quite easy compared  
to inventing methods to
get to work in this specific conditions a kick butt algorithm to get  
the job done real efficiently.

IMHO the reason why most 'investigate' nvidia's GPGPU is because we  
can expect more architectures to focus upon having
many tiny processors with little cache to get released soon.

Vincent

On Aug 26, 2008, at 4:21 AM, Geoff Jacobs wrote:

> Vincent Diepeveen wrote:
>> Hi Mikhail,
>>
>> I'd say they're ok for black box 32 bits calculations that can do  
>> with a
>> GB or 2 RAM,
>> other than that they're just luxurious electric heating.
>
> Rather tangential, if I do say so...
>
> All the latest GPUs (GT200 and R7xx) do DP now.
>
> To answer the original question, AMD very definitely does SIMD.  
> Shaders
> are grouped in fives, each group being dispatched a single VLIW
> instruction which operates on all five. NVidia unified shaders are
> scalar, however.
>
>>
>> Vincent
>>
>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is  
>> really
>> too much.
>> IMHO cpu's as well as gpu's should budget themselves more to a few  
>> tens
>> of watts at most.
>> Say 50 watt for the high end models.
>>
>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>
>>> BTW, why GPGPUs are considered as vector systems ?
>>> Taking into account that GPGPUs contain many (equal) execution  
>>> units,
>>> I think it might be not SIMD, but SPMD model. Or it depends from the
>>> software tools used (CUDA etc) ?
>>>
>>> Mikhail Kuzminsky
>>> Computer Assistance to Chemical Research Center
>>> Zelinsky Institute of Organic Chemistry
>>> Moscow
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> -- 
> Geoffrey D. Jacobs
>