[Beowulf] ARM cpu's and development boards and research

Vincent Diepeveen diep at xs4all.nl
Tue Nov 27 16:32:22 PST 2012


On Nov 28, 2012, at 12:17 AM, Prentice Bisbal wrote:

>
> On 11/27/2012 03:37 PM, Douglas Eadline wrote:
>>
>>> My interest in Arm has been the flip side of balancing flops to  
>>> network
>>> bandwidth.  A standard dual socket (AMD or Intel) can trivially  
>>> saturate
>>> GigE.  One option for improving the flops/network balance is to add
>>> network bandwidth with Infiniband.  Another is a slower, cheaper,  
>>> cooler
>>> CPU and GigE.
>>>
>> applause.
>
> I applaud that applause.
>
> What Bill has just described is known as an "Amdahl-balanced system",
> and is the design philosophy between the IBM Blue Genes and also
> SiCortex. In my opinion, this is the future of HPC. Use lower power,
> slower processors, and then try to improve network performance to  
> reduce
> the cost of scaling out. Essentially, you want the processors to be
> *just* fast enough to keep ahead of the networking and memory, but no
> faster to optimize energy savings.

For HPC the winning concept seems to be increasing corecount at  
manycores.

We also see how bluegene couldn't keep its concept - it's having what  
is it 18+ cores
now or so?

So the manycores have won the battle in HPC bigtime, for codes that  
can get vectorized.

If we look at ARM for example, constructing a huge supercomputer with  
it is from production
viewpoint already impossible as just the size of L1 and L2 caches  
together already makes it too expensive
to produce at a competative price versus a single huge manycore chip.

Suppose you would have a quadcore ARM compete with a Nvidia K20.

The K20 has 1.3 Tflop. that's 1300 Gflop.

The quadcore ARMs are 1 Ghz as well. Each CPU has a L2 cache up to  
1MB L2 cache and a 32KB + 32 KB L1 cache.

So if one would need to produce 325 ARMs, we're speaking of in total  
64 * 325 = 20.8 MB worth of L1 caches
and 325 MB worth of L2 caches.

Producing that K20X is a fraction of the price of that.

Those 64 bits ARMs are going to eat around a watt or 6 at full load.

6 watt * 325 cpu's = 1950 watt and with 325 cores you need cooling  
everywhere.

Now this is an extreme example, but it shows you clearly that for HPC  
small processors can never compete with giant manycores
from a production price viewpoint.

I don't know what price IBM delivers bluegene for, but i'm sure the  
next generation
bluegene CPU will have more cores than the current one.

We can be sure IBM finds the solution there they can offer to their  
clients at a competative manner - yet it will be also getting an ever
bigger chip they have to offer in order to compete against that  
Nvidia offers there now.

As for Intel - they didn't do statements on whether Xeon Phi has  
cache coherency.

Yet i assume they will need to drop that or make it manual.

As for networking i also disagree there.

To do a calculation that uses a FFT, be it an approximation that  
could have backtracking errors like with floating point FFT,
or be it a lossless calculation using a NTT.

In both cases the network needs these algorithms have is a fraction  
of the CPU power they posses.

So if you have a manycore with enough RAM on it, you can to a large  
extend do a calculation in there without communicating that.
Only after a bunch of iterations you need to communicate with other  
parts of the network.

So the communication in that sense needs O ( log n ) of what you  
actually calculate. Now the last phases of the algorithm are a tad  
more tricky to do in that manner, but it's the same principle.

Even then the bandwidth needed still is massive - not because it  
isn't O ( log n ), but simply because a card that delivers 1+ Tflop  
simply is that much for todays standards. Biggest bandwidth you can  
achieve over 1 FDR network card is just a fraction of the bandwidth  
that 1 Tflop represents.

If we speak about FMA's (fused multiply adds), we speak about 666  
billion FMA's a second.
So that's reading 3 doubles and writing 1. A total bandwidth of 4  
doubles times 8 bytes times 0.666 TB = 32 * 0.666 = 21+ TB/s
in internal bandwidth that such manycore is handling.

FDR infiniband delivers just  a fraction of that obviously - so the  
reality already is that the network just delivers a small fraction
of what the GPU can handle.

>
> The Blue Genes do this incredibly well, so did SiCortex, and Seamicro
> appears to be doing this really well, too, based on all the press
> they've been getting. With the DARPA Exascale report saying we  
> can't get
> to Exascale with current power consumption profiles, you can bet this
> will be a hot area of research over the next few years.

They'll find a solution - i'm sure of it - yet it will invole massive  
amounts of cores at a single CPU.

Whether it's 3 dimensional or just the next generation transistors or  
some new technology - we'll see that soon :)

>
> Okay. I'm done listening to myself type.
>
> Prentice
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin  
> Computing
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf




More information about the Beowulf mailing list