[Beowulf] A sea of wimpy cores
stewart at serissa.com
Fri Sep 17 07:53:11 PDT 2010
On Sep 17, 2010, at 9:11 AM, Bill Rankin wrote:
> On Sep 17, 2010, at 7:39 AM, Hearns, John wrote:
> Interesting article (more of a letter really) - to be honest when I first scanned it I was not sure of what Holzle's actual argument was. To me, he omitted a lot of the details that makes this discussion much less black-and-white (and much more interesting) than he would contend:
> 1) He cites Amdahls' law, but leaves out Gustafson's.
> 2) Yeah, parallel programming is hard. Deal with it. It's helps me pay my bills.
> 3) While I am not a die hard FLOPS/Watt evangelist, he seems to completely ignore the power and cooling cost when discussing infrastructure costs.
> The whole letter just seems like it's full of non-sequiturs and just a general gripe about processor architectures.
This letter of Holzle's is consistent with our experience at SiCortex. The cores we had were far more power efficient than the x86's, but they were slower. Because the interconnect was so fast, generally you could scale up farther than with commodity clusters so that you got better absolute performance and better price-performance, but it was tiring to argue these points over and over again. Especially to customers who weren't paying for their power or infrastructure and didn't really value the low power aspect.
Holzle's letter doesn't go into enough detail however. One of the other ideas at SiCortex was that a slow core wouldn't affect application performance of codes that were actually limited by the memory system. We noticed many codes running at 1 - 5% of peak performance, spending the rest of their time burning a lot of power waiting for the memory. I think this argument has yet to be tested, because the first generation SC machines didn't actually have a very good memory system. The cores were limited to a single outstanding miss. I think there is a fairly good case to be made that systems with slower, low power cores can get higher average efficiencies (% of peak) than fast cores -- provided that the memory systems are close to equivalent. Everyone is using the same DRAMs.
Of course this argument doesn't work well if the application is compute bound, or fits in the cache.
There are lots of alternative ideas in this space. Hyperthreading switches to a different thread when one blocks on the memory, turboboost runs <faster> when the power envelope permits. I recall a paper or two about slowing down cores in a parallel application until the app itself started to run more slowly, etc.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf