[Beowulf] gpgpu
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bruno Coutinho coutinho at dcc.ufmg.brFri Aug 29 08:04:18 PDT 2008
- Previous message: [Beowulf] gpgpu
- Next message: [Beowulf] gpgpu
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
In this problem: http://arstechnica.com/news.ars/post/20080430-ps3s-cell-cpu-tops-high-performance-computing-benchmark.html They obtained at maximum 30% of peak performance in x86 processors. In Cell and niagara2 they obtained about 60% peak performance. It seems that in memory intensive codes, the processor must have massive memory bandwidth to get near the peak performance. 2008/8/29 Mikhail Kuzminsky <kus at free.net> > In message from "Li, Bo" <libo at buaa.edu.cn> (Fri, 29 Aug 2008 08:15:42 > +0800): > >> Yes, Firestream has a great paper performance, but how can you get it? >> But for the costs, if you don't mind to use some un-professional >> components, you can try their gaming cards, much cheaper. We bought NVidia's >> last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you >> can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops >> for each. And we have achieved 40% of their peak performance, sounds good. >> > > But which percent of peak value you may have on x86 CPU ? > If it's something like sgemm, then it looks not too attractive for me :-( : > on usual x86 I may obtain about 90% of peak performance, plus performance > difference of Xeon/Opteron CPUs w/GPU on DP is not too high :-( > Mikhail > > > Regards, >> Li, Bo >> ----- Original Message ----- From: "Mikhail Kuzminsky" <kus at free.net> >> To: "Li, Bo" <libo at buaa.edu.cn> >> Cc: "Vincent Diepeveen" <diep at xs4all.nl>; <beowulf at beowulf.org> >> Sent: Friday, August 29, 2008 1:52 AM >> Subject: Re: [Beowulf] gpgpu >> >> >> In message from "Li, Bo" <libo at buaa.edu.cn> (Thu, 28 Aug 2008 14:20:15 >>> +0800): >>> >>>> ... >>>> Currently, the DP performance of GPU is not good as we expected, or only >>>> 1/8 1/10 of SP Flops. It is also a problem. >>>> >>> >>> AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W for >>> DP. It's 5 times slower than SP. >>> >>> Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS >>> DP. The price will be, I suppose, about $2000 - as for 9170. >>> >>> Let me look to modern dual socket quad-core beowulf node w/price about >>> $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP >>> performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 GFLOPS. >>> >>> Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs. >>> Is it enough for essential calculation speedup - taking into account time >>> for data transmission to/from GPU ? >>> >>>> I would suggest hybrid computation platforms, with GPU, CPU, and >>>> processors like Clearspeed. It may be a good topic for programming model. >>>> >>> >>> Clearspeed, if there is no new hardware now, has not enough DP >>> performance in comparison w/typical modern servers on quad-core CPUs. >>> >>> Yours >>> Mikhail >>> >>>> Regards, >>>> Li, Bo >>>> ----- Original Message ----- From: "Vincent Diepeveen" <diep at xs4all.nl> >>>> To: "Li, Bo" <libo at buaa.edu.cn> >>>> Cc: "Mikhail Kuzminsky" <kus at free.net>; "Beowulf" <beowulf at beowulf.org> >>>> Sent: Thursday, August 28, 2008 12:22 AM >>>> Subject: Re: [Beowulf] gpgpu >>>> >>>> >>>> Hi Bo, >>>>> >>>>> Thanks for your message. >>>>> >>>>> What library do i call to find primes? >>>>> >>>>> Currently it's searching here after primes (PRP's) in the form of p >>>>> = (2^n + 1) / 3 >>>>> >>>>> n is here about 1.5 million bits roughly as we speak. >>>>> >>>>> For SSE2 type processors there is the George Woltman assembler code >>>>> (MiT) to do the squaring + implicit modulo; >>>>> how do you plan to beat that type of real optimized number crunching >>>>> at a GPU? >>>>> >>>>> You'll have to figure out a way to find an instruction level >>>>> parallellism of at least 32, >>>>> which also doesn't write to the same cacheline, i *guess* (no >>>>> documentation to verify that in fact). >>>>> >>>>> So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes >>>>> >>>>> In fact the first problem to solve is to do some sort of squaring real >>>>> quickly. >>>>> >>>>> If you figured that out at a PC, experience learns you're still losing >>>>> a potential of factor 8, >>>>> thanks to another zillion optimizations. >>>>> >>>>> You're not allowed to lose factor 8. that 52 gflop a gpu can deliver >>>>> on paper @ 250 watt TDP (you bet it will consume that >>>>> when you let it work so hard) means GPU delivers effectively less than >>>>> 7 gflops double precision thanks to inefficient code. >>>>> >>>>> Additionally remember the P4. On paper in integers claim was when it >>>>> released it would be able to execute 4 integers a >>>>> cycle, reality is that it was a processor getting an IPC far under 1 >>>>> for most integer codes. All kind of stuff sucked at it. >>>>> >>>>> The experience learns this is the same for todays GPU's, the >>>>> scientists who have run codes on it so far and are really experienced >>>>> CUDA programmers, figured out the speed it delivers is a very big >>>>> bummer. >>>>> >>>>> Additionally 250 watt TDP for massive number crunching is too much. >>>>> >>>>> It's well over factor 2 power consumption of a quadcore. Now i can >>>>> take a look soon in China myself what power prices >>>>> are over there, but i can assure you they will rise soon. >>>>> >>>>> Now that's a lot less than a quadcore delivers with a tdp far under >>>>> 100 watt. >>>>> >>>>> Now i explicitly mention the n's i'm searching here, as it should fit >>>>> within caches. >>>>> So the very secret bandwidth you can practical achieve (as we know >>>>> nvidia lobotomized >>>>> bandwidth in the GPU cards, only the Tesla type seems to be not >>>>> lobotomized), >>>>> i'm not even teasing you with that. >>>>> >>>>> This is true for any type of code. You're losing it to the details. >>>>> Only custom tailored solutions will work, >>>>> simply because they're factors faster. >>>>> >>>>> Thanks, >>>>> Vincent >>>>> >>>>> On Aug 27, 2008, at 2:50 AM, Li, Bo wrote: >>>>> >>>>> Hello, >>>>>> IMHO, it is better to call the BLAS or similiar libarary rather than >>>>>> programing you own functions. And CUDA treats the GPU as a cluster, so .CU >>>>>> is not working as our normal codes. If you have got >>>>>> to many matrix or vector computation, it is better to use Brook+/ CAL, >>>>>> which can show great power of AMD gpu. >>>>>> Regards, >>>>>> Li, Bo >>>>>> ----- Original Message ----- >>>>>> From: "Mikhail Kuzminsky" <kus at free.net> >>>>>> To: "Vincent Diepeveen" <diep at xs4all.nl> >>>>>> Cc: "Beowulf" <beowulf at beowulf.org> >>>>>> Sent: Wednesday, August 27, 2008 2:35 AM >>>>>> Subject: Re: [Beowulf] gpgpu >>>>>> >>>>>> >>>>>> In message from Vincent Diepeveen <diep at xs4all.nl> (Tue, 26 Aug 2008 >>>>>>> 00:30:30 +0200): >>>>>>> >>>>>>>> Hi Mikhail, >>>>>>>> >>>>>>>> I'd say they're ok for black box 32 bits calculations that can do >>>>>>>> with >>>>>>>> a GB or 2 RAM, >>>>>>>> other than that they're just luxurious electric heating. >>>>>>>> >>>>>>> >>>>>>> I also want to have simple blackbox, but 64-bit (Tesla C1060 or >>>>>>> Firestream 9170 or 9250). Unfortunately the life isn't restricted to >>>>>>> BLAS/LAPACK/FFT :-) >>>>>>> >>>>>>> So I'll need to program something other. People say that the best >>>>>>> choice is CUDA for Nvidia. When I look to sgemm source, it has about >>>>>>> 1 >>>>>>> thousand (or higher) strings in *.cu files. Thereofore I think that a >>>>>>> bit more difficult alghorithm as some special matrix diagonalization >>>>>>> will require a lot of programming work :-(. >>>>>>> >>>>>>> It's interesting, that when I read Firestream Brook+ "kernel >>>>>>> function" >>>>>>> source example - for addition of 2 vectors ("Building a High Level >>>>>>> Language Compiler For GPGPU", >>>>>>> Bixia Zheng (bixia.zheng at amd.com) >>>>>>> Derek Gladding (dereked.gladding at amd.com) >>>>>>> Micah Villmow (micah.villmow at amd.com) >>>>>>> June 8th, 2008) >>>>>>> >>>>>>> - it looks SIMPLE. May be there are a lot of details/source lines >>>>>>> which were omitted from this example ? >>>>>>> >>>>>>> >>>>>>> Vincent >>>>>>>> p.s. if you ask me, honestely, 250 watt or so for latest gpu is >>>>>>>> really >>>>>>>> too much. >>>>>>>> >>>>>>> >>>>>>> 250 W is TDP, the average value declared is about 160 W. I don't >>>>>>> remember, which GPU - from AMD or Nvidia - has a lot of special >>>>>>> functional units for sin/cos/exp/etc. If they are not used, may be >>>>>>> the >>>>>>> power will a bit more lower. >>>>>>> >>>>>>> What is about Firestream 9250, AMD says about 150 W (although I'm not >>>>>>> absolutely sure that it's TDP) - it's as for some >>>>>>> Intel Xeon quad-cores chips w/names beginning from X. >>>>>>> >>>>>>> Mikhail >>>>>>> >>>>>>> >>>>>>> On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote: >>>>>>>> >>>>>>>> BTW, why GPGPUs are considered as vector systems ? >>>>>>>>> Taking into account that GPGPUs contain many (equal) execution >>>>>>>>> units, >>>>>>>>> I think it might be not SIMD, but SPMD model. Or it depends from >>>>>>>>> the software tools used (CUDA etc) ? >>>>>>>>> >>>>>>>>> Mikhail Kuzminsky >>>>>>>>> Computer Assistance to Chemical Research Center >>>>>>>>> Zelinsky Institute of Organic Chemistry >>>>>>>>> Moscow >>>>>>>>> _______________________________________________ >>>>>>>>> Beowulf mailing list, Beowulf at beowulf.org >>>>>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> _______________________________________________ >>>>>>> Beowulf mailing list, Beowulf at beowulf.org >>>>>>> To change your subscription (digest mode or unsubscribe) visit >>>>>>> http://www.beowulf.org/mailman/listinfo/beowulf >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>> > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://www.scyld.com/pipermail/beowulf/attachments/20080829/6e366585/attachment.html
- Previous message: [Beowulf] gpgpu
- Next message: [Beowulf] gpgpu
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
