<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Calibri, sans-serif; ">

<div>

<div>

<div><br>

</div>

<div>ASICs will beat FPGAs every day because the feature size is smaller, you don't have to waste real estate (and leakage current) on gates that aren't used, etc.  Even if you're talking one time programmable FPGAs (like Actel AX series), you are still applying

 power to the gates not in the current configuration.  Even the biggest FPGAs are smaller (in terms of transistor count) than modern processors, general purpose or GPU.</div>

<div><br>

</div>

<div>GPUS  are like general purpose CPUs in the sense that most of the logic is fixed, and the "program" controls what they do, so they are more like ASICs (actually, in one sense they ARE an ASIC).</div>

<div><br>

</div>

<div>The difference is in the cost of a design spin.  I can reprogram the configuration of a Xilinx Virtex 5 in a few milliseconds, I can probe into the internal state using the JTAG port, and if the logic is wrong, I can change a line of Verilog or VHDL, resynthesize

 in a few minutes, and try it again. etc.   Spinning a new rev of a GPU or ASIC is a multimillion dollar exercise.</div>

<div>

<div>

<p class="MsoNormal" style="margin-top: 0in; margin-right: 0in; margin-left: 0in; margin-bottom: 0.0001pt; font-size: 11pt; font-family: Calibri, sans-serif; ">

<o:p> </o:p></p>

</div>

</div>

</div>

</div>

<div><br>

</div>

<span id="OLK_SRC_BODY_SECTION">

<div style="font-family:Calibri; font-size:11pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">

<span style="font-weight:bold">From: </span>Reiner Hartenstein <<a href="mailto:hartenst@rhrk.uni-kl.de">hartenst@rhrk.uni-kl.de</a>><br>

<span style="font-weight:bold">Reply-To: </span>"<a href="mailto:reiner@hartenstein.de">reiner@hartenstein.de</a>" <<a href="mailto:reiner@hartenstein.de">reiner@hartenstein.de</a>><br>

<span style="font-weight:bold">Date: </span>Saturday, March 16, 2013 12:11 AM<br>

<span style="font-weight:bold">To: </span>"<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>" <<a href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>><br>

<span style="font-weight:bold">Subject: </span>Re: [Beowulf] The GPU power envelope ------ Re: Beowulf Digest, Vol 109, Issue 22 ----<br>

</div>

<div><br>

</div>

<div>

<div text="#000000" bgcolor="#FFFFFF">

<div class="moz-cite-prefix"><br>

<div class="moz-signature">

<title>Reiner Hartenstein</title>

<p style="margin-top: 0; margin-bottom: 0"><br>

<font color="#000000"><span style="font-family: Arial;">which accelerators have the better power envelope?   GPUs or FPGAs?<br>

</span></font></p>

<p style="margin-top: 0; margin-bottom: 0"><font color="#000000"><span style="font-family: Arial;"><br>

<br>

</span></font></p>

<p style="margin-top: 0; margin-bottom: 0"><font color="#000000"><span style="font-family: Arial;">Best regards,</span><font face="Arial"><br>

</font><span style="font-family: Arial;">Reiner<br>

</span></font></p>

<p style="margin-top: 0; margin-bottom: 0"><font color="#000000"><span style="font-family: Arial;"><br>

</span><font face="Arial"></font></font></p>

<br>

</div>

Am 15.03.2013 20:00, schrieb <a class="moz-txt-link-abbreviated" href="mailto:beowulf-request@beowulf.org">

beowulf-request@beowulf.org</a>:<br>

</div>

<blockquote cite="mid:mailman.1.1363374001.5342.beowulf@beowulf.org" type="cite">

<pre wrap="">Send Beowulf mailing list submissions to

        <a class="moz-txt-link-abbreviated" href="mailto:beowulf@beowulf.org">beowulf@beowulf.org</a>

To subscribe or unsubscribe via the World Wide Web, visit

        <a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>

or, via email, send a message with subject or body 'help' to

        <a class="moz-txt-link-abbreviated" href="mailto:beowulf-request@beowulf.org">beowulf-request@beowulf.org</a>

You can reach the person managing the list at

        <a class="moz-txt-link-abbreviated" href="mailto:beowulf-owner@beowulf.org">beowulf-owner@beowulf.org</a>

When replying, please edit your Subject line so it is more specific

than "Re: Contents of Beowulf digest..."

Today's Topics:

   1. Re: The GPU power envelope (was difference between

      accelerators) (Lux, Jim (337C))

----------------------------------------------------------------------

Message: 1

Date: Fri, 15 Mar 2013 03:52:25 +0000

From: "Lux, Jim (337C)" <a class="moz-txt-link-rfc2396E" href="mailto:james.p.lux@jpl.nasa.gov"><james.p.lux@jpl.nasa.gov></a>

Subject: Re: [Beowulf] The GPU power envelope (was difference between

        accelerators)

To: Beowulf List <a class="moz-txt-link-rfc2396E" href="mailto:beowulf@beowulf.org"><beowulf@beowulf.org></a>

Message-ID: <a class="moz-txt-link-rfc2396E" href="mailto:CD67E739.2E0F8%james.p.lux@jpl.nasa.gov"><CD67E739.2E0F8%james.p.lux@jpl.nasa.gov></a>

Content-Type: text/plain; charset="us-ascii"

I think what you've got here is basically the idea that "things that are

closer, consume less power and cost less because you don't have the

"interface cost".

A FPU sitting on the bus with the integer ALU inside the chip has minimum

overhead.. No going on and off chip and the associated level shifting, no

charging and discharging of the transmission lines, etc.

A coprocessor sitting on the bus with the CPU is a bit worse.. The

connection has to go off chip, so you have to change voltage levels, and

physically charge and discharge a longer trace/transmission line.

A graphics card on a PCI bus has not only the on/off chip transition, it

has more than one because the PCI interface also goes through that. More

capacitors to charge and discharge too.

A second node connected with some wideband interconnect, but in a

different box...

You get the idea..

This is why people are VERY interested in on chip optical transmitters and

receivers (e.g. Things like VCSELs and APDs).  You could envision a

processor with an array of transmitters and receivers to create point to

point links to other processors that are within the field of view.  Only

one "change of media"

On 3/14/13 4:29 AM, "Vincent Diepeveen" <a class="moz-txt-link-rfc2396E" href="mailto:diep@xs4all.nl"><diep@xs4all.nl></a> wrote:

</pre>

<blockquote type="cite">

<pre wrap="">On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:

</pre>

<blockquote type="cite">

<pre wrap=""></pre>

<blockquote type="cite">

<blockquote type="cite">

<pre wrap="">I think HSA is potentially interesting for HPC, too.

  I really expect

AMD and/or Intel to ship products this year that have a C/GPU chip

mounted on

the same interposer as some high-bandwidth ram.

</pre>

</blockquote>

<pre wrap="">How can an integrated gpu outperform a gpgpu card?

</pre>

</blockquote>

<pre wrap="">if you want dedicated gpu computation, a gpu card is ideal.

obviously, integrated GPUs reduce the PCIe latency overhead,

and/or have an advantage in directly accessing host memory.

I'm merely pointing out that the market has already transitioned to

putting integrated gpus - the vote on this is closed.

the real question is what direction the onboard gpu takes:

how integrated it becomes with the cpu, and how it will take

advantage of upcoming 2.5d-stacked in-package dram.

</pre>

</blockquote>

<pre wrap="">Integrated gpu's will of course always have a very limited power budget.

So the gpgpu cards with the same generation gpu for gpgpu from the

same manufacturer with a bigger power envelope

is always going to be 10x faster of course.

If you'd get 10 computers with 10 apu's, even for a small price, you

still would need an expensive network and switch to

handle those, so that's 10 ports. So that's 1000 dollar a port

roughly, so that's $10k extra, and we assume then that your

massive supercomputer doesn't get into trouble further up in

bandwidth otherwise your network cost suddenly gets $3000 a port

instead of $2k a port, with factor 10 ports more.

That's always going to lose it moneywise from a single gpgpu card

that's 10x faster.

Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to

be 10x faster of course and 10x cheaper for massive supercomputing.

</pre>

<blockquote type="cite">

<pre wrap=""></pre>

<blockquote type="cite">

<pre wrap="">Something like what is it 25 watt versus 250 watt, what will be

faster?

</pre>

</blockquote>

<pre wrap="">per-watt?  per dollar?  per transaction?

the integrated gpu is, of course, merely a smaller number of cores

as the

separate card, so will perform the same, relative to a proportional

slice of the appropriate-generation add-in-card.

trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,

delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and

the CPU.

let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.

about 1/3 the cores, flops, and something less than 1/5 the bandwidth.

no doubt the lower bandwidth will hurt some codes, and the lower

host-gpu

latency will help others.  I don't know whether APUs have the same

SP/DP ratio as comparable add-in cards.

</pre>

<blockquote type="cite">

<pre wrap="">I assume you will not build 10 nodes with 10 cpu's with integrated

gpu in order to rival a

single card.

</pre>

</blockquote>

<pre wrap="">no, as I said, the premise of my suggestion of in-package ram is

that it would permit glueless tiling of these chips.  the number

you could tile in a 1U chassis would primarily be a question of

power dissipation.

32x 40W units would be easy.  perhaps 20 60W units.  since I'm just

making up numbers here, I'm going to claim that performance will be

twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.

I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/

socket - a total capacity of 40 GB/RU at 1280 GB/s.

compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and

320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.

</pre>

</blockquote>

<pre wrap="">_______________________________________________

Beowulf mailing list, <a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a> sponsored by Penguin Computing

To change your subscription (digest mode or unsubscribe) visit

<a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a></pre>

</blockquote>

<pre wrap="">

------------------------------

_______________________________________________

Beowulf mailing list

<a class="moz-txt-link-abbreviated" href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><a class="moz-txt-link-freetext" href="http://www.beowulf.org/mailman/listinfo/beowulf">http://www.beowulf.org/mailman/listinfo/beowulf</a>

End of Beowulf Digest, Vol 109, Issue 22

****************************************

</pre>

</blockquote>

<br>

</div>

</div>

</span>

</body>

</html>