NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?

Tue Jun 17 14:28:24 PDT 2008

Richard,

The big question that we should raise to answer your HPC market is  
whether assembler coded x64 SSE2 code also counts.
Nothing as weird as all those bizarre SSE2 instructions. There is no  
way you can make an objective analysis why some weirdo
instructions are in and why some very useful stuff is not in.

Let's zoom in into one detail.

What we really need in SSE2 to really speed up bigtime some FFT type  
transforms as well as make it attractive to all kind of small sized
codes is to have an instruction lowbitsmultiply. If we represent a  
register as 2 x 64 bits integers:  A:B and we multiply with C:D.

Then we want to be able to execute the next 2 instructions,  
especially at intel processors:

lowbitsmultiply   :   A:B * C:D ===>   AC (mod 2^64) : BD (mod 2^64)
highbitsmultiply :   A:B * C:D ===>   AC >> 64       : BD >> 64
where '>>' means shiftright 64

Basically in floating point the itanium could have such instructions  
as the above,
so why not the PC processors also for 64 bits?

Of course we need these 2 instructions a tad faster than currently  
multiplying an integer is in the integer units.
We want a throughput of 1 cycle without blocking other execution  
units of course. When further vectorizing the
SSE units it is of course not nice when a single instruction in one  
of the execution units blocks all others.

There is no real perfect hardware from HPC viewpoint. Fiddling with  
SSE2 instructions is something very few programmers are good
at in order to get their thing done in a cryptic manner, as  
algorithms and caches haven't become simpler since the 80s.

So i would argue that the number crunching market already gets  
dominated by SSE2 acceleration and will see even more of that
type of stuff, which basically means a return to the 80s in some  
sense for programmers who want to get the maximum out of a CPU.

Interesting now is when Intel comes with something and AMD with yet  
unother crunching koncept (YUCK),
taking care more cash gets pocketted by programmers.

So that is good news for the jobless low level programmers (me, says  
the fool) :)

Note if GPU's get equipped with this type of instructions, even when  
just 32 bits,
that would already be a big step, as that allows CRT (Chinese  
Remainder Theorem) to get the job done,
which is a manner to currently solve it at the PC.

Point is that register size and which type of instructions your  
hardware supports matters. It especially matters to release which ones
you support. Videocards simply have the luck that our coordinate  
system can get expressed easily with 32 bits floating points.

When needed even 16 bits.

So i'd not wait for a GPU that is entirely 64 bits double precision  
and/or 64 bits integers,
as that would slow them down for their biggest market.

Note that it would on paper be possible right now to make a  
chessprogram within a GPU run real fast,
if the RAM accesses would be latency optimized (so 2 random acceses  
of all stream processors, each 10000 cycles
to the RAM, first access being a read, the second one a write). All  
other theoretic problems i solved on paper.

Yet this small 'problem' is dominating it that much that i'm not  
gonna gamble programming in order to find out perhaps
that my assumption is correct :)

However to reveal some of the math to compare the GPU's versus the  
CPU's when i would design a new 'braindead'
chessprogram:

Let's say we have 240 cores 32 bits at 0.675 Ghz

Let's assume now we need 10k cycles for each node at each  
streamprocessor:
(1 nps = 1 node per second = 1 chessposition a second searching for  
the holy grail)
I assume on average 1 instruction a cycle (dangerous assumption  
considering there is 2 RAM hits also)

675000k  / 10k     = 67.5k nps
240 * 67.5k nps  = 16.2 million nps

Now if we compare with what i fight against in tournaments,
that is a skulltrail mainboard with 2 xeon cpu's overclocked to 4Ghz,
so 8 cores in total with big RAM.

If i see what practical nps is that fast chessprograms get at it  
right now,
then that is about 20 million nps.

PC faster than GPU.

I will skip all kind of technical discussions, such as where PC loses  
something;
its memory controller is not near fast enough to serve every node, so  
it loses bigtime
there last plies, at least 20-30%, and that we assume the same  
speedup for 8 cores
versus 240 cores, where game tree search is one of the hardest to  
parallellize challenges;
so you effectively will lose a lot more at 240 cores than at 8 cores.  
In fact i would guess
it's 30% efficiency for the GPU versus 87.5% for the PC. Yet there is  
things to discover there
which make up for an interesting challenge so i skip that algorithmic  
discussion entirely here,
for now.

In short for problems that the past were latency oriented, the  
dominating factor is:
"how many instructions per second can you execute".

The PC simply wins it from the GPU here, this for a typical 32 bits  
problem (in case of my chess software).
The PC simply can effectively execute more instructions a cycle, when  
also counting the instructions
it can skip by taking branches.

Note that the PC also wins it in powerconsumption at 4Ghz @ 2 sockets.

It has taken me many months to redo the above math, trying to find a  
solution to get somehow GPU faster than PC.
Didn't manage so far.

The biggest 2 differences between a CPU and a GPU when doing such  
math is the low clock of the GPU versus the
high clock of the CPU. CPU at events already wins a factor 6 nearly  
just based upon clockspeed.

I've showed up at a world championship in 2003 with a supercomputer  
with 500Mhz cpu's and fought against opponents
at 2.8Ghz MP Xeon cpu's. Also a factor 6 difference nearly already in  
clockspeed.

That is really a big dang in your selfconfidence.

Maybe it is good that Greg Lindahl didn't join in that event, he  
would've googled a tad and showed up with that
one-liner of Seymour Cray.

Vincent

On Jun 17, 2008, at 9:00 PM, richard.walsh at comcast.net wrote:

>
> -------------- Original message --------------
> From: Jim Lux <James.P.Lux at jpl.nasa.gov>
>
> > Well.. to be fair, there were (and still are) businesses out there
> > (particularly a few years ago) that didn't fully understand the
> > concept of needing net profit. (ah yes, the glory days of startups
> > "buying market share" in the dot-com bubble) And, some folks made a
> > fine living in the mean time. (But, then, those folks weren't the
> > owners, were they, or if they were, in a limited sense, they now  
> have
> > some decorative wallpaper..)
> >
>
> Hey Jim,
>
> Gold rushes are good (and greed I guess too ... ;-) ...) ... there  
> IS often gold in them there hills, it is just that very few if  
> anyone knows exactly where.  So, the less risk averse among us and  
> those with more money than sense (thankfully, I say) starting  
> digging.  Most of their trials end in error, but the rest of us  
> benefit from the few that are lucky/smart enough to find it.  I  
> think you are assuming that the futures are far more predictable  
> than it in fact is, even for the best and brightest like  
> yourself ... what percentage of the HPC market will accelerators  
> have at this time next year?
>
> Regards,
>
> rbw
>
> -- 
>
> "Making predictions is hard, especially about the future."
>
> Niels Bohr
>
> -- 
>
> Richard Walsh
> Thrashing River Consulting--
> 5605 Alameda St.
> Shoreview, MN 55126
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf