[Beowulf] Xeon Phi sucks in practical double precision?
diep at xs4all.nl
Fri Mar 22 07:35:17 PDT 2013
On Mar 18, 2013, at 7:17 PM, Craig Tierney - NOAA Affiliate wrote:
> On Thu, Mar 14, 2013 at 5:42 AM, Vincent Diepeveen <diep at xs4all.nl>
>> On Mar 12, 2013, at 5:45 AM, Mark Hahn wrote:
>>> trinity a10-5700 has 384 radeon 69xx cores running at 760 MHz,
>>> delivering 584 SP gflops - 65W iirc. but only 30 GB/s for it and
>>> the CPU.
>>> let's compare that to a 6930 card: 1280 cores, 154 GB/s, 1920
>>> about 1/3 the cores, flops, and something less than 1/5 the
>>> no doubt the lower bandwidth will hurt some codes, and the lower
>>> latency will help others. I don't know whether APUs have the same
>>> SP/DP ratio as comparable add-in cards.
>> Since when in HPC do we want SP gflops?
> I thought we learned on this list to not generalize as it can create
> flame-wars? The people in HPC who care about SP gflops are those who
> understand the mathematics in their algorithms and don't want to waste
> very precious memory bandwidth by unnecessarily promoting their
> floating point numbers to double unless there is a good (and
> measurable) reason to do so. We use double precision, but only when
> it is required to maintain numerical stability. The result is lower
> memory requirements to store model state and codes that run faster.
Oh comeon grow up.
For the past decades in HPC i've only see scientist worry about
Intel releases now a new chip and suddenly only single precision
without ECC matters.
So the conclusion of this discussion is: Xeon Phi we must be very
careful to draw conclusions about its double
precision capabilities as we only hear something about SINGLE
precision 32 bits floating point. 23 bits mantissa.
Someone who's interested in Xeon Phi i'd warn: "be careful buying it
if you intend to use it for double precision".
>> Double precision rules. Let's talk about double precision and ECC.
>> How many APU's have ECC?
>> As for power calculations. A single box here eats 170 watt under full
>> load (all cpu's under full load),
>> feeding a tesla which eats practical a tad over 300 watt. So the
>> total is under 500 watt.
>> Now you go run 100 apu's to get to the same double precision
>> crunching power. Each box eating 150 watt.
>> That's 15 kilowatt for the APU's.
>> Even if you would be using 10, which in single precision delivers the
>> same, you stll eat 1500 kilowatt.
>> You have to maintain for that 10 computers, versus just 1 Tesla.
>> Now even if an apu would be able to deliver double precision and a
>> lot and have ECC,
>> even then you have got 10 infiniband cables from those 10 APU's, and
>> you need another 250 watt for
>> that and another bunch of switches of 300 watt that the Tesla
>> solution doesn't need.
>>>> I assume you will not build 10 nodes with 10 cpu's with integrated
>>>> gpu in order to rival a
>>>> single card.
>>> no, as I said, the premise of my suggestion of in-package ram is
>>> that it would permit glueless tiling of these chips. the number
>>> you could tile in a 1U chassis would primarily be a question of
>>> power dissipation.
>>> 32x 40W units would be easy. perhaps 20 60W units. since I'm just
>>> making up numbers here, I'm going to claim that performance will be
>>> twice that of trinity (a nice round 1 Tflop apiece or 20 Tflops/RU.
>>> I speculate that 4x 4Gb in-package gddr5 would deliver 64 GB/s, 2GB/
>>> socket - a total capacity of 40 GB/RU at 1280 GB/s.
>>> compare this to a 1U server hosting 2-3 K10 cards = 4.6 Gflops and
>>> 320 GB/s each. 13.8 Gflops, 960 GB/s. similar power dissipation.
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf