[Beowulf] Has anyone actually seen/used a cell system?

Sun Oct 1 18:06:44 PDT 2006

Andrew Shewmaker wrote:
> On 10/1/06, Geoff Jacobs <gdjacobs at gmail.com> wrote:
> 
>> > interesting that for a 2.4GHz Cell, they get at most 10 FP Gflops
>> per SPE.
>> > does anyone have SGEMM numbers for a 3GHz Intel Core2?  I'll guess that
>> > efficiency of libgoto with 2 threads would be >= 80%, so flops would be
>> > .8*2*8*3 =~ 40 Gflops, or half a Cell chip. makes it hard to argue for
>> > wide use of Cell, I think...
>>
>> Unfortunately, the reality is a little crappier. Sciencemark 2.0 SGEMM
>> sees 11 gflops on an E6700. DGEMM sees 5-6 gflops.
>> http://www.pcper.com/article.php?aid=265&type=expert&pid=3
> 
> The same site reports that the X6800, a 2.93 GHz Core2 and sees
> almost 12.5 SP GFLOPS using ScienceMark 2.0 (6.2 DP GFLOPS).
> 
> http://www.pcper.com/article.php?aid=272&type=expert&pid=5
> 
> I don't know much about ScienceMark.  The website has been
> replaced with advertisements.  From what I gathered from several
> review sites, it is MS Windows only and single threaded.  My
> guess is that Goto's implementation would perform significantly
> better even with a single thread.  Unfortunately, I looked all over
> and couldn't find Core2 benchmarks using Goto's BLAS.

WRT ScienceMark sgemm being multithreaded, I think you're right. The
prime web site is German, and explicitly states that BLAS tests are not
SMP. My bad :|
http://www.sciencemark.de/faq.html

Sciencemark has been hand optimized for many processors, so it's
performance shouldn't be much worse than MKL or Goto. I'm assuming it's
indicative of per core performance on Core 2. Is it safe to say that
Core 2 achieves <15 gflops/core at 3ghz, assuming ~15% premium with Goto
BLAS?

>> The linked article _is_ an evaluation of performance on an actual Cell
>> chip. Unfortunately, it's a lower clocked pre-production example running
>> an experimental pseudo-compiler. I'm interested in seeing SGEMM using
>> Cell-specific intrinsics. Such a benchmark should represent the maximum
>> practical performance peak.
>>
>> Note: even if the Sequoia numbers are approximately the same as SPE
>> intrinsics, cell is still 7x faster than Core2.
> 
> The Sequoia implementation used IBM's Cell SDK, according to the paper.
> 
> It looks like a preproduction 2.4 GHz Cell is 2-6 times faster than a
> 2.93 GHz
> Core2 at SGEMM.  That's an awfully big range, so hopefully someone
> wil be kind enough to benchmark libgoto on Core2 for us.  The history file
> indicates that libgoto is optimized for Core2, but I don't have one to
> test.

>From what I read, it churned out two code paths which were either
compiled using xlc (spu) or gcc (ppu). This sounds like it is very
different to using Intrinsics (spu_mul, spu_add, etc...), which would be
applied more like an SSE optimization.

I guess my biggest objection to Mark's comment was the comparison of
SGEMM implemented in an experimental language with unproven structure
with a theoretical calculation of Core 2 peak performance. I'd simply
like to see a benchmark comparison of SGEMM (and DGEMM) using Core
2-optimized BLAS vs. Cell-optimized BLAS, thereby making a useful
conclusion about how interesting Cell is for HPC.

-- 
Geoffrey D. Jacobs

Go to the Chinese Restaurant,
Order the Special