[Beowulf] GPU's - was Westmere EX

Vincent Diepeveen diep at xs4all.nl
Thu Apr 21 05:59:30 PDT 2011


hi,

Sometimes going through some old emails.

Note in the meantime i switched from AMD-CAL to OpenCL.

On Apr 8, 2011, at 3:03 AM, Gus Correa wrote:

> Thank you for the information about AMD-CAL and the AMD GPUs.
> Does AMD plan any GPU product with 64-bit and ECC,
> similar to Tesla/Fermi?

Actually DDR5 already calculates a CRC. Not as good as ECC, but it  
takes care you
have a form of checking. Also the amount of bitflips is so little as  
the quality of this DDR5 is so great,
according to some memory experts i spoke with, that this CRC is more  
than sufficient.

As i'm not a memory expert i would advice you to really speak with  
such a guy instead of some HPC guys here.

Now if your organisation wants ECC simply i'm not going to argue. A  
demand is a demand there.

I'm busy pricewise here how to build cheap something that delivers a  
big punch.

If you look objectively and then to gpgpu codes, then of course  
Nvidia has a few years more experience
setting up CUDA.

This is another problem of course, software support. Both suck at it,  
to say polite.

Yet we want to do calculations cheap huh.

Yet if performance matters, then AMD is a very cheap alternative.
In both cases of course, programming for a gpu is going to be the  
bottleneck;
historically organisations do not invest in good code, they only  
invest in hardware and in managers who
sit on their behind, drink coffee and do meetings.

Objectively most codes you can also code in 32 bits.

If we do a simple compare then the HD6990 is there for 540 euro in  
the shop here. Now that's European prices
where salestax is 19%, so in USA probably it's cheaper (if you  
calculate it back to euro's).

Let's now ignore the marketing nonsense ok, as marketing nonsense is  
marketing nonsense.
All those theoretic flops always, they shouldn't allow double  
counting specific instructions like multiply add.

The internals of these gpu's are all organized such that doing  
efficient matrix calculations on them is very well
possible. Not easy to solve well, as the bottleneck will be the  
bandwidth from the DDR3 cpu ram to the gpu,
  yet if you look to a lot of calculations, then it's algorithmic  
possible to do a lot more work at the execution unit
side than the bandwidth you need to another node; those execution  
units, PE's (processing elements)
  nowadays called, have huge GPR's which can proces all that. With  
that those tiny cheap power efficient cores
can easily take on huge expensive cpu cores. A single set of 4 PE's  
in case of AMD has a total of 1024 GPR's,
can read from a L1 cache when needed and write to a shared local  
cache of 32KB (shared by 64 pe's).

That L1 reads from the memory L2 and all that has a huge bandwidth.

That gives you PRACTICAL 3072 PE's @ 0.83 Ghz == 2.5+ Tflop in 32  
bits integers. It's not so hard to convert
that to 64 bits code if that's what you need. In fact i'm using it to  
approximate huge integers (prime numbers)
of million bit sizes (factorisation of them).

Using that efficiently is not easy, yet realize this is 2.5+ Tflop (i  
should actually say Tera 32 bits integer performance).
Good programmers can use todays GPU's very efficiently.

The 6000+ series of AMD and the Fermi series of Nvidia are very good  
and you can use them in a sustained manner.

Now the cheapest gpgpu of Nvidia is about $1200 which is the quadro  
6000 series and delivers 448 cores @ 1.2Ghz,
say roughly 550 Gflop.

Of course this is practical what you can achieve, i'm not counting of  
course multiply-add here as being 2 flops,
which is their own definition of how many gflops it gets; first of  
all i'm not interested in flops but in integers per cycle
and secondly i prefer a realistic measure, otherwise we have no  
measure on how efficiently we use the gpu.

If you look from a mathematical viewpoint, it's not so clever from  
most scientists at todays huge calculations to use
floating point. Single precision or double precision, in the end it  
all backtracks errors and you have complete non-deterministic results  
with big sureness.

Much better are integer transforms where you have 100% lossless  
calculations so big sureness your calculation is ok.

Yet i realize this is a very expertise field with most people who  
know something about that hiding in secrecy using fake names
and some even having fake burials, just in order to disappear. That  
in itself is all very sad, as progressing science doesn't
happen. As a result of that scientific world has focussed too much  
upon floating point.

Yet the cards can deliver that as well as we know.

The round off errors all those floating point calculations cause are  
always a huge multiple of bitflips of memory.
It's not even in the same league. Now of course calculating with 64  
bits integers it's easier to do huge transforms
and you can redo your calculation and at some spots you will have  
deterministic output in such case, in others of
course not (depends what you calculate of course - majority is non- 
deterministic).

With 32 bits integers you need a lot of CRT (Chinese Remainder  
Theorem) tricks to effectively use it for huge transforms,
or you simply emulate 64 bits calculations (so with 64 bits  
precision, please do not confuse with double precision
floating point).

Getting all that to work is very challenging and not easy, i realize  
that.

Yet look at the huge advantage you give to your scientists in such case.

They can look years ahead in the future which is a *huge* advantage.

In this manner you'll actually effectively get 2.x Tflop out of those  
6990, again that's 2 Tflop calculated in my manner, i'm looking  
simply at INSTRUCTION LEVEL where 1 instruction represents a single  
unit of 32 bits; counting the multiply-add instruction
as 2 flops is just too confusing for how efficient you manage to load  
your GPU, if you ask me.

In the transforms in fact multiply-add is very silly to use in many  
cases as that means you're doing some sort of inefficient
calculation.

Yet that chippie is just 500 euro, versus Nvidia delivers it for 1200  
dollar and the nvidia one is factor 3 slower,
though still lightyears faster than a CPU solution there (pricewise  
seen).

The quadro 6000 for those who don't realize it, is exactly the same  
like a Tesla. Just checkout the specs.

Yet of course for our lazy scientists all of the above is not so  
interesting. Just compiling your years 80 code,
pushing the enter button, is a lot easier.

If you care however for PERFORMANCE, consider spending a couple of  
thousands to hardware.

If you buy 1000 of those 6990's and program in opencl, you actually  
can also run that at nvidia hardware, might
nvidia be so lucky to release very quickly a 22 nm gpu some years  
from now. By then also nvidia's opencl will probably
be supporting the gpu hardware quite ok.

So my advice would be: program it in opencl. It's not the most  
efficient language on the planet, yet it'll work everywhere
and you can get probably around 2 Tflop out that 6990 AMD card.

That said of course there is a zillion problems still with opencl,  
yet if you want for $500k in gpu hardware
achieve 1 petaflop, you'll have to suffer a bit, and by the time your  
cluster is there, possibly all big bugs have
been fixed in opencl both by amd as well as by nvidia for their gpu  
lines.

Now all this said i do realize that you need a shift in thinking.  
Whether you use AMD-gpu's or Nvidia, in both cases you'll
need great new software. In fact it doesn't even matter whether you  
program it in OpenCL or CUDA. It's easy to port algorithms
from 1 entity to another; getting such algorithm to work is a lot  
harder than the question what language you program it in.

Translating CUDA to openCL is pretty much braindead work which many  
can carry out as we already saw in some examples.
The investment is in the software for the gpu's.

You don't buy that in from nvidia nor AMD. You'l have to hire people  
to program it, as your own scientists simply aren't good
enough to program efficiently for that GPU. The old fashionned vision  
of having scientists solve themselve how to do the
calculations is not going to work for gpgpu simply.

Now that is a big pitfall that is hard to overcome.

All this said, of course there is a few, really very few,  
applications where a full blown gpu nor hybrid solution is able to
solve the problems. Yet usually such claim that it is "not possible"  
gets done by scientists who are experts in their field,
but not very high level in finding solutions how to efficiently get  
their calculations done in HPC.

Regards,
Vincent

>
> The lack of a language standard may still be a hurdle here.
> I guess there were old postings here about CUDA and OpenGL.
> What fraction of the (non-gaming) GPU code is being written these days
> in CUDA, in AMD-CAL, and in OpenCL (if any), or perhaps using
> compiler directives like those in the PGI compilers?
>
> Thank you,
> Gus Correa
>
> Vincent Diepeveen wrote:
>>
>> On Apr 7, 2011, at 6:25 PM, Gus Correa wrote:
>>
>>> Vincent Diepeveen wrote:
>>>
>>>> GPU monster box, which is basically a few videocards inside such a
>>>> box stacked up a tad, wil only add a couple of
>>>> thousands.
>>>>
>>>
>>> This price may be OK for the videocard-class GPUs,
>>> but sounds underestimated, at least for Fermi Tesla.
>>
>> Tesla (448 cores @ 1.15Ghz, 3GB ddr5) : $2.200
>> note there is a 6 GB version, not aware of price will be $$$$ i bet.
>> or AMD 6990 (3072 PE's @ 0.83Ghz, 4GB ddr5) : 519 euro
>>
>> VERSUS
>>
>> 8 socket Nehalem-ex, 512GB ram DDR3, basic configuration, $205k.
>>
>> Factor 100 difference to those cards.
>>
>> A couple of thousands versus a couple of hundreds of thousands.
>> Hope i made my point clear.
>>
>>
>>> Last I checked, a NVidia S2050 pizza box with four Fermi Tesla  
>>> C2050,
>>> with 448 cores and 3GB RAM per GPU, cost around $10k.
>>> For the beefed up version with with C2070 (6GB/GPU) it bumps to ~ 
>>> $15k.
>>> If you care about ECC, that's the price you pay, right?
>>
>> When fermi released it was a great gpu.
>>
>> Regrettably they lobotomized the gamers card's double precision as i
>> understand,
>> So it hardly has double precision capabilities; if you go for  
>> nvidia you
>> sure need a Tesla,
>> no question about it.
>>
>> As a company i would buy in 6990's though, they're a lot cheaper and
>> roughly 3x faster
>> than the Nvidia's (for some more than 3x for other occassions less  
>> than
>> 3x, note the card
>> has 2 GPU's and 2 x 2GB == 4 GB ram on board so 2GB per gpu).
>>
>> 3072 cores @ 0.83Ghz with 50% of 'em 32 bits multiplication units  
>> for AMD
>> versus 448 cores nvidia with 448 execution units of 32 bits  
>> multiplication.
>>
>> Especially because multiplication has improved a lot.
>>
>> Already having written CUDA code some while ago, i wanted the cheap
>> gamers card with big
>> horse power now at home so  i'm toying on a 6970 now so will be  
>> able to
>> report to you what is possible to
>> achieve at that card with respect to prime numbers and such.
>>
>> I'm a bit amazed so little public initiatives write code for the  
>> AMD gpu's.
>>
>> Note that DDR5 ram doesn't have ECC by default, but has in case of  
>> AMD a
>> CRC calculation
>> (if i understand it correctly). It's a bit more primitive than  
>> ECC, but
>> works pretty ok and shows you
>> also when problems occured there, so figuring out remove what goes  
>> on is
>> possible.
>>
>> Make no mistake that this isn't ECC.
>> We know some HPC centers have as a hard requirement ECC, only  
>> nvidia is
>> an alternative then.
>>
>> In earlier posts from some time ago and some years ago i already  
>> wrote
>> on that governments should
>> adapt more to how hardware develops rather than demand that  
>> hardware has
>> to follow them.
>>
>> HPC has too little cash to demand that from industry.
>>
>> OpenCL i cannot advice at this moment (for a number of reasons).
>>
>> AMD-CAL and CUDA are somewhat similar. Sure there is differences, but
>> majority of codes are possible
>> to port quite well (there is exceptions), or easy work arounds.
>>
>> Any company doing gpgpu i would advice developing both branches of  
>> code
>> at the same time,
>> as that gives the company a lot of extra choices for really very  
>> little
>> extra work. Maybe 1 coder,
>> and it always allows you to have the fastest setup run your  
>> production
>> code.
>>
>> That said we can safely expect that from raw performance coming years
>> AMD will keep the leading edge
>> from crunching viewpoint. Elsewhere i pointed out why.
>>
>> Even then i'd never bet at just 1 manufacturer. Go for both  
>> considering
>> the cheap price of it.
>>
>> For a lot of HPC centers the choice of nvidia will be an easy one, as
>> the price of the Fermi cards
>> is peanuts compared to the price rest of the system and considering
>> other demands that's what they'll go for.
>>
>> That might change once you stick in bunches of videocards in nodes.
>>
>> Please note that the gpu 'streamcores' or PE's whatever name you  
>> want to
>> give them, are so bloody fast,
>> that your code has to work within the PE's themselves and hardly  
>> use the
>> RAM.
>>
>> Both for Nvidia as well as AMD, the streamcores are so fast, that you
>> simply don't want to lose time on the RAM
>> when your software runs, let alone that you want to use huge RAM.
>>
>> Add to that, that nvidia (have to still figure out for AMD) can in
>> background stream from and to the gpu's RAM
>> from the CPU, so if you do really large calculations involving  
>> many nodes,
>> all that shouldn't be an issue in the first place.
>>
>> So if you really need 3 GB or 6 GB rather than 2 GB of RAM, that  
>> would
>> really amaze me, though i'm sure
>> there is cases where that happens. If we see however what was  
>> ordered it
>> mostly is the 3GB Tesla's,
>> at least on what has been reported, i have no global statistics on  
>> that...
>>
>> Now all choices are valid there, but even then we speak about peanuts
>> money compared to the price of
>> a single 8 socket Nehalem-ex box, which fully configured will be  
>> maybe
>> $300k-$400k or something?
>>
>> Whereas a set of 4x nvidia will be probably under $15k and 4x AMD  
>> 6990
>> is 2000 euro.
>>
>> There won't be 2 gpu nvidia's any soon because of the choice they  
>> have
>> historically made for the memory controllers.
>> See explanation of intel fanboy David Kanter for that at  
>> realworldtech
>> in a special article he wrote there.
>>
>> Please note i'm not judging AMD nor Nvidia, they have made their  
>> choices
>> based upon totally different
>> businessmodels i suspect and we must be happy we have this rich  
>> choice
>> right now between cpu's from different
>> manufacturers and gpu's from different manufacturers.
>>
>> Nvidia really seems to aim at supercomputers, giving their tesla line
>> without lobotomization and lobotomizing their
>> gamers cards, where AMD aims at gamers and their gamercards have full
>> functionality
>> without lobotomization.
>>
>> Total different businessmodels. Both have their advantages and
>> disadvantages.
>>
>>  From pure performance viewpoint it's easy to see what's faster  
>> though.
>>
>> Yet right now i realize all too well that just too many still  
>> hesitate
>> between also offering gpu services additional to
>> cpu services, in which case having a gpu, regardless nvidia or amd,
>> kicks butt of course from throughput viewpoint.
>>
>> To be really honest with you guys, i had expected that by 2011 we  
>> would
>> have a gpu reaching far over 1 Teraflop double precision  
>> handsdown. If
>> we see that Nvidia delivers somewhere around 515 Gflop and AMD has 2
>> gpu's on a single card to get over that Teraflop double precision  
>> (claim
>> is 1.27 Teraflop double precision),
>> that really is underneath my expectations from a few years ago.
>>
>> Now of course i hope you realize i'm not coding double precision  
>> code at
>> all; i'm writing everything in integers of 32 bits for the AMD  
>> card and
>> the Nvidia equivalent also is using 32 bits integers. The ideal  
>> way to
>> do calculations on those cards, so also very big transforms, is using
>> the 32 x 32 == 64 bits instructions (that's 2 instructions in case  
>> of AMD).
>>
>> Regards,
>> Vincent
>>
>>
>>>
>>> Gus Correa
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list