<div dir="ltr">In this problem:<br><a href="http://arstechnica.com/news.ars/post/20080430-ps3s-cell-cpu-tops-high-performance-computing-benchmark.html">http://arstechnica.com/news.ars/post/20080430-ps3s-cell-cpu-tops-high-performance-computing-benchmark.html</a><br>

<br>They obtained at maximum 30% of peak performance in x86 processors. In Cell and niagara2 they obtained about 60% peak performance.<br>It seems that in memory intensive codes, the processor must have massive memory bandwidth to get near the peak performance.<br>

<br><br><div class="gmail_quote">2008/8/29 Mikhail Kuzminsky <span dir="ltr"><<a href="mailto:kus@free.net">kus@free.net</a>></span><br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

In message from "Li, Bo" <<a href="mailto:libo@buaa.edu.cn" target="_blank">libo@buaa.edu.cn</a>> (Fri, 29 Aug 2008 08:15:42 +0800):<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Yes, Firestream has a great paper performance, but how can you get it?<br>

But for the costs, if you don't mind to use some un-professional components, you can try their gaming cards, much cheaper. We bought NVidia's last flagship card 8800Ultra for 600 Euro, what's a crazy price, and now you can buy two GTX280 for less. If you can bear SP, and you will get 936GFlops for each. And we have achieved 40% of their peak performance, sounds good.<br>


</blockquote>

<br>

But which percent of peak value you may have on x86 CPU ?<br>

If it's something like sgemm, then it looks not too attractive for me :-( :<br>

on usual x86 I may obtain about 90% of peak performance, plus performance difference of Xeon/Opteron CPUs w/GPU on DP is not too high :-( <br>

Mikhail<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Regards,<br>

Li, Bo<br>

----- Original Message ----- From: "Mikhail Kuzminsky" <<a href="mailto:kus@free.net" target="_blank">kus@free.net</a>><br>

To: "Li, Bo" <<a href="mailto:libo@buaa.edu.cn" target="_blank">libo@buaa.edu.cn</a>><br>

Cc: "Vincent Diepeveen" <<a href="mailto:diep@xs4all.nl" target="_blank">diep@xs4all.nl</a>>; <<a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>><br>

Sent: Friday, August 29, 2008 1:52 AM<br>

Subject: Re: [Beowulf] gpgpu<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

In message from "Li, Bo" <<a href="mailto:libo@buaa.edu.cn" target="_blank">libo@buaa.edu.cn</a>> (Thu, 28 Aug 2008 14:20:15 +0800):<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

...<br>

Currently, the DP performance of GPU is not good as we expected, or only 1/8 1/10 of SP Flops. It is also a problem.<br>

</blockquote>

<br>

AMD data: Firestream 9170 SP performance is 5 GFLOPS/W vs 1 GFLOPS/W for DP. It's 5 times slower than SP.<br>

<br>

Firestream 9250 has 1 TFLOPS for SP, therefore 1/5 is about 200 GFLOPS DP. The price will be, I suppose, about $2000 - as for 9170.<br>

<br>

Let me look to modern dual socket quad-core beowulf node w/price about $4000+, for example. For Opteron 2350/2 Ghz chips (I use) peak DP performance is 64 GFLOPS (8 cores). For 3 Ghz Xeon chips - about 100 GFLOPS. <br>

Therefore GPGPU peak DP performance is 1.5-2 times higher than w/CPUs.<br>

Is it enough for essential calculation speedup - taking into account time for data transmission to/from GPU ?    <br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

I would suggest hybrid computation platforms, with GPU, CPU, and processors like Clearspeed. It may be a good topic for programming model.<br>

</blockquote>

<br>

Clearspeed, if there is no new hardware now, has not enough DP performance in comparison w/typical modern servers on quad-core CPUs.<br>

<br>

Yours<br>

Mikhail <br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Regards,<br>

Li, Bo<br>

----- Original Message ----- From: "Vincent Diepeveen" <<a href="mailto:diep@xs4all.nl" target="_blank">diep@xs4all.nl</a>><br>

To: "Li, Bo" <<a href="mailto:libo@buaa.edu.cn" target="_blank">libo@buaa.edu.cn</a>><br>

Cc: "Mikhail Kuzminsky" <<a href="mailto:kus@free.net" target="_blank">kus@free.net</a>>; "Beowulf" <<a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>><br>

Sent: Thursday, August 28, 2008 12:22 AM<br>

Subject: Re: [Beowulf] gpgpu<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi Bo,<br>

<br>

Thanks for your message.<br>

<br>

What library do i call to find primes?<br>

<br>

Currently it's searching here after primes (PRP's)  in the form of p <br>

= (2^n + 1) / 3<br>

<br>

n is here about 1.5 million bits roughly as we speak.<br>

<br>

For SSE2 type processors there is the George Woltman assembler code <br>

(MiT) to do the squaring + implicit modulo;<br>

how do you plan to beat that type of real optimized number crunching <br>

at a GPU?<br>

<br>

You'll have to figure out a way to find an instruction level  parallellism of at least 32,<br>

which also doesn't write to the same cacheline, i *guess* (no  documentation to verify that in fact).<br>

<br>

So that's a range of 256 * 32 = 2^8 * 2^5 = 2^13 = 8192 bytes<br>

<br>

In fact the first problem to solve is to do some sort of squaring  real quickly.<br>

<br>

If you figured that out at a PC, experience learns you're still  losing a potential of factor 8,<br>

thanks to another zillion optimizations.<br>

<br>

You're not allowed to lose factor 8. that 52 gflop a gpu can deliver <br>

on paper @ 250 watt TDP (you bet it will consume that<br>

when you let it work so hard) means GPU delivers effectively less  than 7 gflops double precision thanks to inefficient code.<br>

<br>

Additionally remember the P4. On paper in integers claim was when it <br>

released it would be able to execute 4 integers a<br>

cycle, reality is that it was a processor getting an IPC far under 1 <br>

for most integer codes. All kind of stuff sucked at it.<br>

<br>

The experience learns this is the same for todays GPU's, the  scientists who have run codes on it so far and are really experienced<br>

CUDA programmers, figured out the speed it delivers is a very big  bummer.<br>

<br>

Additionally 250 watt TDP for massive number crunching is too much.<br>

<br>

It's well over factor 2 power consumption of a quadcore. Now i can  take a look soon in China myself what power prices<br>

are over there, but i can assure you they will rise soon.<br>

<br>

Now that's a lot less than a quadcore delivers with a tdp far under <br>

100 watt.<br>

<br>

Now i explicitly mention the n's i'm searching here, as it should fit  within caches.<br>

So the very secret bandwidth you can practical achieve (as we know  nvidia lobotomized<br>

bandwidth in the GPU cards, only the Tesla type seems to be not  lobotomized),<br>

i'm not even teasing you with that.<br>

<br>

This is true for any type of code. You're losing it to the details. <br>

Only custom tailored solutions will work,<br>

simply because they're factors faster.<br>

<br>

Thanks,<br>

Vincent<br>

<br>

On Aug 27, 2008, at 2:50 AM, Li, Bo wrote:<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hello,<br>

IMHO, it is better to call the BLAS or similiar libarary rather  than programing you own functions. And CUDA treats the GPU as a  cluster, so .CU is not working as our normal codes. If you have got <br>

to many matrix or vector computation, it is better to use Brook+/ CAL, which can show great power of AMD gpu.<br>

Regards,<br>

Li, Bo<br>

----- Original Message -----<br>

From: "Mikhail Kuzminsky" <<a href="mailto:kus@free.net" target="_blank">kus@free.net</a>><br>

To: "Vincent Diepeveen" <<a href="mailto:diep@xs4all.nl" target="_blank">diep@xs4all.nl</a>><br>

Cc: "Beowulf" <<a href="mailto:beowulf@beowulf.org" target="_blank">beowulf@beowulf.org</a>><br>

Sent: Wednesday, August 27, 2008 2:35 AM<br>

Subject: Re: [Beowulf] gpgpu<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

In message from Vincent Diepeveen <<a href="mailto:diep@xs4all.nl" target="_blank">diep@xs4all.nl</a>> (Tue, 26 Aug 2008<br>

00:30:30 +0200):<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Hi Mikhail,<br>

<br>

I'd say they're ok for black box 32 bits calculations that can do  with<br>

a GB or 2 RAM,<br>

other than that they're just luxurious electric heating.<br>

</blockquote>

<br>

I also want to have simple blackbox, but 64-bit (Tesla C1060 or<br>

Firestream 9170 or 9250). Unfortunately the life isn't restricted to<br>

BLAS/LAPACK/FFT :-)<br>

<br>

So I'll need to program something other. People say that the best<br>

choice is CUDA for Nvidia. When I look to sgemm source, it has  about 1<br>

thousand (or higher) strings in *.cu files. Thereofore I think that a<br>

bit more difficult alghorithm  as some special matrix diagonalization<br>

will require a lot of programming work :-(.<br>

<br>

It's interesting, that when I read Firestream Brook+ "kernel  function"<br>

source example - for addition of 2 vectors ("Building a High Level<br>

Language Compiler For GPGPU",<br>

Bixia Zheng (<a href="mailto:bixia.zheng@amd.com" target="_blank">bixia.zheng@amd.com</a>)<br>

Derek Gladding (<a href="mailto:dereked.gladding@amd.com" target="_blank">dereked.gladding@amd.com</a>)<br>

Micah Villmow (<a href="mailto:micah.villmow@amd.com" target="_blank">micah.villmow@amd.com</a>)<br>

June 8th, 2008)<br>

<br>

- it looks SIMPLE. May be there are a lot of details/source lines<br>

which were omitted from this example ?<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

Vincent<br>

p.s. if you ask me, honestely, 250 watt or so for latest gpu is  really<br>

too much.<br>

</blockquote>

<br>

250 W is TDP, the average value declared is about 160 W. I don't<br>

remember, which GPU - from AMD or Nvidia - has a lot of special<br>

functional units for sin/cos/exp/etc. If they are not used, may be  the<br>

power will a bit more lower.<br>

<br>

What is about Firestream 9250, AMD says about 150 W (although I'm not<br>

absolutely sure that it's TDP) - it's as for some<br>

Intel Xeon quad-cores chips w/names beginning from X.<br>

<br>

Mikhail<br>

<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On Aug 23, 2008, at 10:31 PM, Mikhail Kuzminsky wrote:<br>

<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

BTW, why GPGPUs are considered as vector systems ?<br>

Taking into account that GPGPUs contain many (equal) execution<br>

units,<br>

I think it might be not SIMD, but SPMD model. Or it depends from<br>

the software tools used (CUDA etc) ?<br>

<br>

Mikhail Kuzminsky<br>

Computer Assistance to Chemical Research Center<br>

Zelinsky Institute of Organic Chemistry<br>

Moscow<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit<br>

<a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

<br>

</blockquote>

<br>

</blockquote>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit  <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</blockquote>

<br>

<br>

</blockquote>

<br>

</blockquote></blockquote>

<br>

</blockquote></blockquote>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</blockquote></div><br></div>