[Beowulf] The GPU power envelope ------ Re: Beowulf Digest, Vol 109, Issue 22 ----

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Sun Mar 17 21:13:48 PDT 2013

ASICs will beat FPGAs every day because the feature size is smaller, you don't have to waste real estate (and leakage current) on gates that aren't used, etc.  Even if you're talking one time programmable FPGAs (like Actel AX series), you are still applying power to the gates not in the current configuration.  Even the biggest FPGAs are smaller (in terms of transistor count) than modern processors, general purpose or GPU.

GPUS  are like general purpose CPUs in the sense that most of the logic is fixed, and the "program" controls what they do, so they are more like ASICs (actually, in one sense they ARE an ASIC).

The difference is in the cost of a design spin.  I can reprogram the configuration of a Xilinx Virtex 5 in a few milliseconds, I can probe into the internal state using the JTAG port, and if the logic is wrong, I can change a line of Verilog or VHDL, resynthesize in a few minutes, and try it again. etc.   Spinning a new rev of a GPU or ASIC is a multimillion dollar exercise.

From: Reiner Hartenstein <hartenst at rhrk.uni-kl.de<mailto:hartenst at rhrk.uni-kl.de>>
Reply-To: "reiner at hartenstein.de<mailto:reiner at hartenstein.de>" <reiner at hartenstein.de<mailto:reiner at hartenstein.de>>
Date: Saturday, March 16, 2013 12:11 AM
To: "beowulf at beowulf.org<mailto:beowulf at beowulf.org>" <beowulf at beowulf.org<mailto:beowulf at beowulf.org>>
Subject: Re: [Beowulf] The GPU power envelope ------ Re: Beowulf Digest, Vol 109, Issue 22 ----

which accelerators have the better power envelope?   GPUs or FPGAs?

Best regards,

Am 15.03.2013 20:00, schrieb beowulf-request at beowulf.org<mailto:beowulf-request at beowulf.org>:

Send Beowulf mailing list submissions to
        beowulf at beowulf.org<mailto:beowulf at beowulf.org>

To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
        beowulf-request at beowulf.org<mailto:beowulf-request at beowulf.org>

You can reach the person managing the list at
        beowulf-owner at beowulf.org<mailto:beowulf-owner at beowulf.org>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Beowulf digest..."

Today's Topics:

   1. Re: The GPU power envelope (was difference between
      accelerators) (Lux, Jim (337C))


Message: 1
Date: Fri, 15 Mar 2013 03:52:25 +0000
From: "Lux, Jim (337C)" <james.p.lux at jpl.nasa.gov><mailto:james.p.lux at jpl.nasa.gov>
Subject: Re: [Beowulf] The GPU power envelope (was difference between
To: Beowulf List <beowulf at beowulf.org><mailto:beowulf at beowulf.org>
Message-ID: <CD67E739.2E0F8%james.p.lux at jpl.nasa.gov><mailto:CD67E739.2E0F8%james.p.lux at jpl.nasa.gov>
Content-Type: text/plain; charset="us-ascii"

I think what you've got here is basically the idea that "things that are
closer, consume less power and cost less because you don't have the
"interface cost".

A FPU sitting on the bus with the integer ALU inside the chip has minimum
overhead.. No going on and off chip and the associated level shifting, no
charging and discharging of the transmission lines, etc.

A coprocessor sitting on the bus with the CPU is a bit worse.. The
connection has to go off chip, so you have to change voltage levels, and
physically charge and discharge a longer trace/transmission line.

A graphics card on a PCI bus has not only the on/off chip transition, it
has more than one because the PCI interface also goes through that. More
capacitors to charge and discharge too.

A second node connected with some wideband interconnect, but in a
different box...

You get the idea..

This is why people are VERY interested in on chip optical transmitters and
receivers (e.g. Things like VCSELs and APDs).  You could envision a
processor with an array of transmitters and receivers to create point to
point links to other processors that are within the field of view.  Only
one "change of media"

On 3/14/13 4:29 AM, "Vincent Diepeveen" <diep at xs4all.nl><mailto:diep at xs4all.nl> wrote:

On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:

I think HSA is potentially interesting for HPC, too.
  I really expect
AMD and/or Intel to ship products this year that have a C/GPU chip
mounted on
the same interposer as some high-bandwidth ram.

How can an integrated gpu outperform a gpgpu card?

if you want dedicated gpu computation, a gpu card is ideal.
obviously, integrated GPUs reduce the PCIe latency overhead,
and/or have an advantage in directly accessing host memory.

I'm merely pointing out that the market has already transitioned to
putting integrated gpus - the vote on this is closed.
the real question is what direction the onboard gpu takes:
how integrated it becomes with the cpu, and how it will take
advantage of upcoming 2.5d-stacked in-package dram.

Integrated gpu's will of course always have a very limited power budget.

So the gpgpu cards with the same generation gpu for gpgpu from the
same manufacturer with a bigger power envelope
is always going to be 10x faster of course.

If you'd get 10 computers with 10 apu's, even for a small price, you
still would need an expensive network and switch to
handle those, so that's 10 ports. So that's 1000 dollar a port
roughly, so that's $10k extra, and we assume then that your
massive supercomputer doesn't get into trouble further up in
bandwidth otherwise your network cost suddenly gets $3000 a port
instead of $2k a port, with factor 10 ports more.

That's always going to lose it moneywise from a single gpgpu card
that's 10x faster.

Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to
be 10x faster of course and 10x cheaper for massive supercomputing.

Something like what is it 25 watt versus 250 watt, what will be

per-watt?  per dollar?  per transaction?

the integrated gpu is, of course, merely a smaller number of cores
as the
separate card, so will perform the same, relative to a proportional
slice of the appropriate-generation add-in-card.

trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,
delivering 584 SP gflops - 65W iirc.  but only 30 GB/s for it and
the CPU.

let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
no doubt the lower bandwidth will hurt some codes, and the lower
latency will help others.  I don't know whether APUs have the same
SP/DP ratio as comparable add-in cards.

I assume you will not build 10 nodes with 10 cpu's with integrated
gpu in order to rival a
single card.

no, as I said, the premise of my suggestion of in-package ram is
that it would permit glueless tiling of these chips.  the number
you could tile in a 1U chassis would primarily be a question of
power dissipation.
32x 40W units would be easy.  perhaps 20 60W units.  since I'm just
making up numbers here, I'm going to claim that performance will be
twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/
socket - a total capacity of 40 GB/RU at 1280 GB/s.

compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and
320 GB/s each.  13.8 Gflops, 960 GB/s.  similar power dissipation.

Beowulf mailing list, Beowulf at beowulf.org<mailto:Beowulf at beowulf.org> sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit


Beowulf mailing list
Beowulf at beowulf.org<mailto:Beowulf at beowulf.org>http://www.beowulf.org/mailman/listinfo/beowulf

End of Beowulf Digest, Vol 109, Issue 22

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20130318/6c7f47f1/attachment.html>

More information about the Beowulf mailing list