NDAs Re: [Beowulf] Nvidia, cuda, tesla and... where's my double floating point?
diep at xs4all.nl
Tue Jun 17 10:07:54 PDT 2008
I feel you notice a very important thing here.
That is that mainly for hobbyists like me a GPU is interesting to
program for, or for companies who have a budget to buy less than a dozen
of them in total.
For ISPs the only thing that matters is the power consumption and for
encryption at a low TCP/IP layer it's too easy to equip all those
cheapo cpu's with encryption coprocessors which are like 1 watt or so
and are delivering enough work to get the 100 mbit/1 gigabit nics
fully busy, in case of public key it's in fact at a speed that you
won't reach at a GPU when managing to parallellize it and it to work
in a great
manner. The ISPs look for full scalable stuff of course such
machines, quite the opposite of having 1 card @ 250 watt.
In fact it would be quite interesting to know how fast you can run
RSA on a GPU. Where are the benchmarks there?
I tend to remember that i posted some solution to do a fast generic
modulo (of course not a new idea, but that you always
hear after figuring something out), with a minimum of code under the
condition you already have multiplying code.
How fast can you multiply big numbers on those GPU's?
4096 x 4096 bits is most interesting there. Then of course take the
modulo quickly and repeat this for the entire exponent-squaring.
That is the only interesting question IMHO, what amount of throughput
does it deliver for RSA4096.
I tend to remember a big bug it has in such a case is that the older
cards (8800 etc) only can do 16 x 16 bits == 32 bits,
whereas at CPU's you can use 64 x 64 bits == 128 bits. BIG difference
Yet those hobbyists who are the interested persons in GPU programming
have limited time
to get software to run and have a budget far smaller than $10k.
They're not even gonna buy as much Tesla's as NASA will.
Not a dozen.
The state in which gpu programming is now is that some big companies
can have a person toy fulltime with 1 gpu,
as of course the idea of having a cpu with hundreds of cores is very
attractive and looks like a realistic future,
so companies must explore that future.
Of course every GPU/CPU company is serious in their aim to produce
products that perform well, we all do not doubt it.
Yet it is only attractive to hobbyists and those hobbyists are not
gonna get any interesting technical data needed to get the maximum
out of the GPU's from Nvidia. This is a big problem. Those hobbyists
have very limited time to get their sparetime products done
to do numbercrunching, so being busy fulltime writing testprograms to
know everything about 1 specific GPU is not something they
all like to do for a hobby. Just having that information will attract
the hobbyists as they are willing to take the risk to buy 1 Tesla and
spend time there. That produces software. That software will have a
based upon that performance perhaps some companies might get interested.
Intel and AMD will be doing better there i hope.
Note that testing CUDA also is suboptimal, it just runs for 5 seconds
or so max. You need a machine with a 2nd videocard.
That requires a mainboard with at least 2x pci-e 16x. How to cluster
that? My cluster cards are pci-x not pci-e. quadrics QM400's.
I can get boards @ 139 euro with 1 slot PCI-E 16x and build quadcore
Q6600 nodes @ 500 euro, as soon as i have a job again.
My macbookpro 17'' has no pci-e 16x slot free though.
So for number crunching, a cluster always wins it from a single
The communication speed over the pci-e from the videocards is too
slow latency to
parallellize software that is not-embarrassingly parallel.
Majority of hobbyists will have a similar problem with nvidia, very
sad in itself.
A good CUDA setup that can beat a simplistic cluster is not so cheap
and easy to program for like building
that cluster is. Also the memory scales better in those clusters than
it does for the cards. If 1 node can do less
work than 1 GPU can, it still means that it's easier to get that
exponential speedup by having a shared cache across
all nodes (this is true for a lot of modern crunching algorithms).
With a GPU you're forced to do all calculation including caching
within the GPU and within the limited device RAM.
Now in contradiction to what most people tend to believe, usually
there is methods to get away with a limited amount
of RAM with modern overwriting methods of caching, even when that
loses a factor 2 there is ways to overcome that.
The biggest limitation is that communication with other nodes is real
Scaling to more nodes is just not gonna happen of course as the
latency between the nodes is real bad and it is
an extra slow hop to latency of course. First from device RAM to RAM
then from RAM to card and from card to RAM and
from RAM to device RAM.
Let's make a list of problems that most clusters here calculate upon
and you'll see how much the GPU concept still needs
to get matured to get it to work well for most codes.
Software that needs low latency interconnects you could get to work
within 1 card only therefore, provided the RAM access is not
Yet all reports so far indicate it is. No information there is just
not very encouraging and for professional crunching work where companies
therefore have a big budget for, building or buying in your own low
power co-processor that so to speak even fits into an ipod is just
So in the end i guess some stupid extension of SSE will give a bigger
increase in crunching power than the in itself attractive gpgpu
The biggest limitation being development time from hobbyists.
On Jun 17, 2008, at 4:01 PM, Jim Lux wrote:
> Quoting Linus Harling <linus at ussarna.se>, on Mon 16 Jun 2008
> 04:31:56 PM PDT:
>> Vincent Diepeveen skrev:
>>> Then instead of a $200 pci-e card, we needed to buy expensive
>>> for that, without getting
>>> very relevant indepth technical information on how to program for
>>> type of hardware.
>>> The few trying on those Tesla's, though they won't ever post this as
>>> their job is fulltime GPU programming,
>>> report so far very dissappointing numbers for applications that
>>> matter for our nations.
>> Tomography is kind of important to a lot of people:
>> But of course, that was done with regular $500 cards, not Teslas.
> Mind you, if you go and get a tomographic scan today, they already
> use fast hardware to do it. Only researchers on limited budgets
> tolerate taking days to reduce the data on a desktop PC. And, while
> the concept of doing faster processing with a <10KEuro box is
> attractive in that environment, I suspect it's a long way from
> being commercially viable in that role.
> The current tomographic technology (e.g. GE Lightspeed) is pretty
> impressive. They slide you in, and 10-15 seconds later, there's 3
> d rendered models and slices on the screen. The equipment is
> pretty hassle free, the UI straightforward from what I could see, etc.
> And, of course, people are willing (currently) to pay many millions
> for a machine to do this. I suspect that the other costs of
> running a CT scanner (both capital and operating) overwhelm the
> cost of the computing power, so going from a $100K box to a $20K
> box is a drop in the bucket. When you're talking MRI, for
> instance, there's the cost of the liquid helium for the magnets.
> That's a long way from a bunch of grad students racking up a bunch
> of PCs.
More information about the Beowulf