[Beowulf] The GPU power envelope ------ Re: Beowulf Digest, Vol 109, Issue 22 ----
hartenst at rhrk.uni-kl.de
Sat Mar 16 01:11:34 PDT 2013
which accelerators have the better power envelope? GPUs or FPGAs?
Am 15.03.2013 20:00, schrieb beowulf-request at beowulf.org:
> Send Beowulf mailing list submissions to
> beowulf at beowulf.org
> To subscribe or unsubscribe via the World Wide Web, visit
> or, via email, send a message with subject or body 'help' to
> beowulf-request at beowulf.org
> You can reach the person managing the list at
> beowulf-owner at beowulf.org
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Beowulf digest..."
> Today's Topics:
> 1. Re: The GPU power envelope (was difference between
> accelerators) (Lux, Jim (337C))
> Message: 1
> Date: Fri, 15 Mar 2013 03:52:25 +0000
> From: "Lux, Jim (337C)" <james.p.lux at jpl.nasa.gov>
> Subject: Re: [Beowulf] The GPU power envelope (was difference between
> To: Beowulf List <beowulf at beowulf.org>
> Message-ID: <CD67E739.2E0F8%james.p.lux at jpl.nasa.gov>
> Content-Type: text/plain; charset="us-ascii"
> I think what you've got here is basically the idea that "things that are
> closer, consume less power and cost less because you don't have the
> "interface cost".
> A FPU sitting on the bus with the integer ALU inside the chip has minimum
> overhead.. No going on and off chip and the associated level shifting, no
> charging and discharging of the transmission lines, etc.
> A coprocessor sitting on the bus with the CPU is a bit worse.. The
> connection has to go off chip, so you have to change voltage levels, and
> physically charge and discharge a longer trace/transmission line.
> A graphics card on a PCI bus has not only the on/off chip transition, it
> has more than one because the PCI interface also goes through that. More
> capacitors to charge and discharge too.
> A second node connected with some wideband interconnect, but in a
> different box...
> You get the idea..
> This is why people are VERY interested in on chip optical transmitters and
> receivers (e.g. Things like VCSELs and APDs). You could envision a
> processor with an array of transmitters and receivers to create point to
> point links to other processors that are within the field of view. Only
> one "change of media"
> On 3/14/13 4:29 AM, "Vincent Diepeveen" <diep at xs4all.nl> wrote:
>> On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>>>>> I think HSA is potentially interesting for HPC, too.
>>>>> I really expect
>>>>> AMD and/or Intel to ship products this year that have a C/GPU chip
>>>>> mounted on
>>>>> the same interposer as some high-bandwidth ram.
>>>> How can an integrated gpu outperform a gpgpu card?
>>> if you want dedicated gpu computation, a gpu card is ideal.
>>> obviously, integrated GPUs reduce the PCIe latency overhead,
>>> and/or have an advantage in directly accessing host memory.
>>> I'm merely pointing out that the market has already transitioned to
>>> putting integrated gpus - the vote on this is closed.
>>> the real question is what direction the onboard gpu takes:
>>> how integrated it becomes with the cpu, and how it will take
>>> advantage of upcoming 2.5d-stacked in-package dram.
>> Integrated gpu's will of course always have a very limited power budget.
>> So the gpgpu cards with the same generation gpu for gpgpu from the
>> same manufacturer with a bigger power envelope
>> is always going to be 10x faster of course.
>> If you'd get 10 computers with 10 apu's, even for a small price, you
>> still would need an expensive network and switch to
>> handle those, so that's 10 ports. So that's 1000 dollar a port
>> roughly, so that's $10k extra, and we assume then that your
>> massive supercomputer doesn't get into trouble further up in
>> bandwidth otherwise your network cost suddenly gets $3000 a port
>> instead of $2k a port, with factor 10 ports more.
>> That's always going to lose it moneywise from a single gpgpu card
>> that's 10x faster.
>> Whether that's Xeon Phi version X Nvidia Kx0X, it's always going to
>> be 10x faster of course and 10x cheaper for massive supercomputing.
>>>> Something like what is it 25 watt versus 250 watt, what will be
>>> per-watt? per dollar? per transaction?
>>> the integrated gpu is, of course, merely a smaller number of cores
>>> as the
>>> separate card, so will perform the same, relative to a proportional
>>> slice of the appropriate-generation add-in-card.
>>> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,
>>> delivering 584 SP gflops - 65W iirc. but only 30 GB/s for it and
>>> the CPU.
>>> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920 Gflops.
>>> about 1/3 the cores, flops, and something less than 1/5 the bandwidth.
>>> no doubt the lower bandwidth will hurt some codes, and the lower
>>> latency will help others. I don't know whether APUs have the same
>>> SP/DP ratio as comparable add-in cards.
>>>> I assume you will not build 10 nodes with 10 cpu's with integrated
>>>> gpu in order to rival a
>>>> single card.
>>> no, as I said, the premise of my suggestion of in-package ram is
>>> that it would permit glueless tiling of these chips. the number
>>> you could tile in a 1U chassis would primarily be a question of
>>> power dissipation.
>>> 32x 40W units would be easy. perhaps 20 60W units. since I'm just
>>> making up numbers here, I'm going to claim that performance will be
>>> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
>>> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/
>>> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>>> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and
>>> 320 GB/s each. 13.8 Gflops, 960 GB/s. similar power dissipation.
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list
> Beowulf at beowulf.org
> End of Beowulf Digest, Vol 109, Issue 22
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf