diep at xs4all.nl
Tue Aug 26 10:16:09 PDT 2008
Well doing DP calculations at a GPU is a bit power and programmers
waste for now Geoff.
Remember, it's a 250 watt TDP monster that GPU, versus a Xeon 45 nm
chippie is 50 watt TDP.
How many units can do DP calculations? Like 40 or so.
According to my calculation that is practical a limit of :
0.675Mhz * 40 units * 2 DP each 128 bits = 80 * 0.675 = 54 gflop.
Sounds like a lot to you?
CELL node is doing 150 (ok granted that's 2 cpu's) and i assume you
have libraries that are optimized to do calculations for you.
Quadcore 2.5ghz 50 watt TDP Xeon ==
2 ipc * 2 DP each 128 bits * 2.5ghz * 4 cores = 40
A quad Xeon is 80 gflops on paper therefore.
So for hardware that is real tough to program for with 2GB ram or so
you can get 54 gflop for like 1000 euro each Tesla box or so.
Maybe add 20% import tax as well == 1200 euro an unit.
Compare for 1200 euro you can buy a dual xeon 2.5Ghz box with 8GB ram
or so, and it has a TDP of 2x50 == 100 watt,
delivering 80 gflop, and you can run the most optimized SSE2 codes
that exist on planet earth, whereas you have to hire a good
CUDA programmer or toy yourself getting that FFT done somehow in a GPU.
Those CUDA programmers have major problems writing good complex
codes, as they do not get any technical specs on the hardware
Or do you happen to know which hardware instructions the GPU has and
how latency to RAM is when all units read device
You can write code in several manners, which will trigger other
Is shifting right and shifting left the same speed?
Can it do rotations of registers?
In short you will lose in advance a factor 4 in speed somewhere, as
you've got no clue how to exactly optimize for the
I've considered the GPU for prime numbers instead of game tree
search, but my FFT code has been written in C code,
and therefore already existing software that's total SSE2 optimized
is kicking it majorleague.
For gametree programming having technical specs on how fast one can
out of a hashtable of 1GB memory randomly lookup
a cacheline (256 bytes i assume) to the 460 stream processors
simultaneously (so each stream processor
looking up a random cacheline, there MIGHT be overlap, but doesn't
need to be, at overlap can be at most for
4% of the lookups overlap at any given moment, which is an
upperbound, and when taking 2GB buffer instead of 1GB of
course the odds of that happening is already quite a tad tinier).
For nearly everything that can profit from 32 bits and is a tad
mature code (so not one-eyed in the land of the blind),
a big cache speeds up calculations bigtime.
Without answer to such a simple latency question i'm not even gonna
take the effort to buy a card.
The 5 seconds of system time i've got at maximum to run CUDA at this
laptop (nvidia 8600M GT inside with 32 stream processors),
is not interesting for serious testing.
Nvidia sells GPU's. For whatever reason you can program them. But the
core business is selling GPU's.
GPU's need basically sngle precision floating points.
You can get any coordinate using 2 single precision floating points,
which is what some software has standardized upon.
So for all kind of software that has to do with graphics, single
precision is enough.
It is wishful thinking that you can buy therefore a GPU and do your
DP calculations faster or at a lower powercost
than at hardware from IBM/Intel/AMD. Those calculatoins sell a lot
for ibm/intel/amd, millions of cpu's. Nvidia isn't selling
millions of gpgpu cards. They're selling graphics card that
additionally you can also program.
There is very little software that can profit from the crunching
power it delivers. You cannot let the cards run too long.
After 1.5 years it is outdated and just eats too much power.
In short there is a very limited number of applications that can get
ported to those GPGPU's.
That doesn't mean they're useless. In contradiction, they're very
helpful for *those* applications.
The big problem is the software. It is difficult to make software for
them, because basically the GPGPU's have one big contradiction.
It's a complete stream environment yet the amount of RAM you can
stream to/from is real limited.
So within that limited data you need to stream, whereas you don't
have enough streaming bandwidth to the RAM.
There is a claim there is a stream bandwidth of 100GB/s or so to the
Now that's fun of course, but the 460 processors can retire 16 bytes
per cycle. At 675Mhz that's:
460 * 0.675G * 16 bytes = 5Terabytes per second.
So there is a factor 50 problem nearly of how ideally such stream
processors get used (to stream, note watch their name)
versus what you can effectively stream to the ram.
The solution for that in most algorithms is to treat each processor
independantly. You can easily use the cache then.
However each streamprocessor cannot get treated independantly. Say a
block of 32 streamprocessors you can treat
independantly. That means those must execute the same code and take
the same branches in the code.
Solving all that is hell of an algorithmic job and a big challenge
for a guy like me.
Yet i have 0 contract jobs there currently. Seems no one wants to pay
for algorithmic solutions/ideas. They just steal them.
If i look around to those who do CUDA programming regurarly, there is
sometimes like 1 small assignment where all go for,
say some scientist who wants to give 1000 dollar to get something done.
That's not very inspiring for GPGPU programming and the results that
you hear about are not very inspiring at all.
Most of those attempts are more of the kind like: "suppose we sell a
product and by accident an user has a nvidia card in his
box, then we can also use that GPU as a 'co processor', to do some
job, that otherwise we let the main processor do".
So they're not doing that to let their code run faster or so. It's
just "profitting" from the fact that some users have a Nvidia gpu.
As for professional usage of it, where you really want to speedup
codes as compared to a 8 core Xeon box,
this is real complicated.
First of all you need to buy tesla boxes to get some bandwidth to the
RAM, and those aren't cheap.
If after 1.5 years you have a company which *might* consider giving
you a contract job to port some
sort of generic simulator framework and you ship an email to nvidia
so that at least you can tell that company, which already is really
reluctant to bet at 1 specific company.
Governments always overdo things on paper; they want their software
generic to avoid
dependencies 10 years from now as those F22's, F35 in fact, will be
in service for a real long time to come.
Nvidia will tell you that only AFTER you sign a deal with them to get
zillions of boxes of them,
they might support it and then only for a short time.
That's not how reality works and simply takes care no European
government will seriously consider those GPU's in a systematic
manner to get used. All what happens is 1 guy who is interested in it
and is doing some study after it.
It is difficult already to show ON PAPER a GPGPU solution to offer
advantages over dual socket boxes,
let alone coming quad socket AMD boxes, knowing that your software
that works well at a 16 core AMD box,
will run even better at the next generation AMD or intel CSI box.
First SHOW someone that THEIR softwares framework calculation gets
executed real fast for them, in some sort of 2 person
pilot project, and THEN maybe get a contract job, whereby they buy
themselves hardware directly from the manufacturers.
How many they buy and what type of simulations they run, that's all
none of my business; i don't even want to know.
Porting code and existing working algorithms is quite easy compared
to inventing methods to
get to work in this specific conditions a kick butt algorithm to get
the job done real efficiently.
IMHO the reason why most 'investigate' nvidia's GPGPU is because we
can expect more architectures to focus upon having
many tiny processors with little cache to get released soon.
On Aug 26, 2008, at 4:21 AM, Geoff Jacobs wrote:
> Vincent Diepeveen wrote:
>> Hi Mikhail,
>> I'd say they're ok for black box 32 bits calculations that can do
>> with a
>> GB or 2 RAM,
>> other than that they're just luxurious electric heating.
> Rather tangential, if I do say so...
> All the latest GPUs (GT200 and R7xx) do DP now.
> To answer the original question, AMD very definitely does SIMD.
> are grouped in fives, each group being dispatched a single VLIW
> instruction which operates on all five. NVidia unified shaders are
> scalar, however.
>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is
>> too much.
>> IMHO cpu's as well as gpu's should budget themselves more to a few
>> of watts at most.
>> Say 50 watt for the high end models.
>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:
>>> BTW, why GPGPUs are considered as vector systems ?
>>> Taking into account that GPGPUs contain many (equal) execution
>>> I think it might be not SIMD, but SPMD model. Or it depends from the
>>> software tools used (CUDA etc) ?
>>> Mikhail Kuzminsky
>>> Computer Assistance to Chemical Research Center
>>> Zelinsky Institute of Organic Chemistry
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit
> Geoffrey D. Jacobs
More information about the Beowulf