NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?
diep at xs4all.nl
Tue Jun 17 14:28:24 PDT 2008
The big question that we should raise to answer your HPC market is
whether assembler coded x64 SSE2 code also counts.
Nothing as weird as all those bizarre SSE2 instructions. There is no
way you can make an objective analysis why some weirdo
instructions are in and why some very useful stuff is not in.
Let's zoom in into one detail.
What we really need in SSE2 to really speed up bigtime some FFT type
transforms as well as make it attractive to all kind of small sized
codes is to have an instruction lowbitsmultiply. If we represent a
register as 2 x 64 bits integers: A:B and we multiply with C:D.
Then we want to be able to execute the next 2 instructions,
especially at intel processors:
lowbitsmultiply : A:B * C:D ===> AC (mod 2^64) : BD (mod 2^64)
highbitsmultiply : A:B * C:D ===> AC >> 64 : BD >> 64
where '>>' means shiftright 64
Basically in floating point the itanium could have such instructions
as the above,
so why not the PC processors also for 64 bits?
Of course we need these 2 instructions a tad faster than currently
multiplying an integer is in the integer units.
We want a throughput of 1 cycle without blocking other execution
units of course. When further vectorizing the
SSE units it is of course not nice when a single instruction in one
of the execution units blocks all others.
There is no real perfect hardware from HPC viewpoint. Fiddling with
SSE2 instructions is something very few programmers are good
at in order to get their thing done in a cryptic manner, as
algorithms and caches haven't become simpler since the 80s.
So i would argue that the number crunching market already gets
dominated by SSE2 acceleration and will see even more of that
type of stuff, which basically means a return to the 80s in some
sense for programmers who want to get the maximum out of a CPU.
Interesting now is when Intel comes with something and AMD with yet
unother crunching koncept (YUCK),
taking care more cash gets pocketted by programmers.
So that is good news for the jobless low level programmers (me, says
the fool) :)
Note if GPU's get equipped with this type of instructions, even when
just 32 bits,
that would already be a big step, as that allows CRT (Chinese
Remainder Theorem) to get the job done,
which is a manner to currently solve it at the PC.
Point is that register size and which type of instructions your
hardware supports matters. It especially matters to release which ones
you support. Videocards simply have the luck that our coordinate
system can get expressed easily with 32 bits floating points.
When needed even 16 bits.
So i'd not wait for a GPU that is entirely 64 bits double precision
and/or 64 bits integers,
as that would slow them down for their biggest market.
Note that it would on paper be possible right now to make a
chessprogram within a GPU run real fast,
if the RAM accesses would be latency optimized (so 2 random acceses
of all stream processors, each 10000 cycles
to the RAM, first access being a read, the second one a write). All
other theoretic problems i solved on paper.
Yet this small 'problem' is dominating it that much that i'm not
gonna gamble programming in order to find out perhaps
that my assumption is correct :)
However to reveal some of the math to compare the GPU's versus the
CPU's when i would design a new 'braindead'
Let's say we have 240 cores 32 bits at 0.675 Ghz
Let's assume now we need 10k cycles for each node at each
(1 nps = 1 node per second = 1 chessposition a second searching for
the holy grail)
I assume on average 1 instruction a cycle (dangerous assumption
considering there is 2 RAM hits also)
675000k / 10k = 67.5k nps
240 * 67.5k nps = 16.2 million nps
Now if we compare with what i fight against in tournaments,
that is a skulltrail mainboard with 2 xeon cpu's overclocked to 4Ghz,
so 8 cores in total with big RAM.
If i see what practical nps is that fast chessprograms get at it
then that is about 20 million nps.
PC faster than GPU.
I will skip all kind of technical discussions, such as where PC loses
its memory controller is not near fast enough to serve every node, so
it loses bigtime
there last plies, at least 20-30%, and that we assume the same
speedup for 8 cores
versus 240 cores, where game tree search is one of the hardest to
so you effectively will lose a lot more at 240 cores than at 8 cores.
In fact i would guess
it's 30% efficiency for the GPU versus 87.5% for the PC. Yet there is
things to discover there
which make up for an interesting challenge so i skip that algorithmic
discussion entirely here,
In short for problems that the past were latency oriented, the
dominating factor is:
"how many instructions per second can you execute".
The PC simply wins it from the GPU here, this for a typical 32 bits
problem (in case of my chess software).
The PC simply can effectively execute more instructions a cycle, when
also counting the instructions
it can skip by taking branches.
Note that the PC also wins it in powerconsumption at 4Ghz @ 2 sockets.
It has taken me many months to redo the above math, trying to find a
solution to get somehow GPU faster than PC.
Didn't manage so far.
The biggest 2 differences between a CPU and a GPU when doing such
math is the low clock of the GPU versus the
high clock of the CPU. CPU at events already wins a factor 6 nearly
just based upon clockspeed.
I've showed up at a world championship in 2003 with a supercomputer
with 500Mhz cpu's and fought against opponents
at 2.8Ghz MP Xeon cpu's. Also a factor 6 difference nearly already in
That is really a big dang in your selfconfidence.
Maybe it is good that Greg Lindahl didn't join in that event, he
would've googled a tad and showed up with that
one-liner of Seymour Cray.
On Jun 17, 2008, at 9:00 PM, richard.walsh at comcast.net wrote:
> -------------- Original message --------------
> From: Jim Lux <James.P.Lux at jpl.nasa.gov>
> > Well.. to be fair, there were (and still are) businesses out there
> > (particularly a few years ago) that didn't fully understand the
> > concept of needing net profit. (ah yes, the glory days of startups
> > "buying market share" in the dot-com bubble) And, some folks made a
> > fine living in the mean time. (But, then, those folks weren't the
> > owners, were they, or if they were, in a limited sense, they now
> > some decorative wallpaper..)
> Hey Jim,
> Gold rushes are good (and greed I guess too ... ;-) ...) ... there
> IS often gold in them there hills, it is just that very few if
> anyone knows exactly where. So, the less risk averse among us and
> those with more money than sense (thankfully, I say) starting
> digging. Most of their trials end in error, but the rest of us
> benefit from the few that are lucky/smart enough to find it. I
> think you are assuming that the futures are far more predictable
> than it in fact is, even for the best and brightest like
> yourself ... what percentage of the HPC market will accelerators
> have at this time next year?
> "Making predictions is hard, especially about the future."
> Niels Bohr
> Richard Walsh
> Thrashing River Consulting--
> 5605 Alameda St.
> Shoreview, MN 55126
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf