[Beowulf] Stream numbers for SiCortex's MIPS based SOC ...

Thu Dec 20 12:19:13 PST 2007

richard.walsh at comcast.net wrote:

> All,
>  
> Anyone seem Stream numbers for one and/or more cores from SiCortx, say 
> a SiCortex
> Catapult System.  The chip has two memory controllers, and I have 
> heard provides:
>  
> "more than 10 Terabytesof bandwidth"
>  
> in the largest configuration, but have not seen any measured memory 
> bandwidth numbers
> for this box.  Come to think of it,  I have not seen measured number 
> for its interconnect
> performance either. Sustaining a reasonable ratio bytes delivered from 
> memory to flops
> should be easier on this processor with its lower clock, but is does 
> have 2 cores.  I am
> interested in how looks compared to Opteron, etc. It is supposed to 
> be a balanced
> design, but it seems there are few measured results available to 
> validate this.
>  
> As always your thoughts are appreciated ...
>  
> Regards,
>  
> rbw 
> -- 

The usual caveats apply: these are microbenchmarks, delivered
application peformance and scalability are what matter.  The metrics
of interest may include absolute performance, cost/performance, and
power/performance.

The SiCortex machines have a substantially different balance of
processing, memory, and communications than desktop machines.  And
don't forget they use about 600 milliwatts per core or 12 watts per
node including 4 GB memory and the interconnect.

Read on...

Regarding the interconnect, we've got some published results in the
2007 Euro/PVM conference last October.  I've just realized that that
paper is not on our website, but I'll get that fixed.

We've measured short message latency at 1.4 microseconds half-round
trip (ping pong). This isn't as fast as some ping pong results, but
when running at scale, the HPCC Random Ring latency is under 2
microseconds when all 648 cores of an SC648 are active at once.  The
fastest machine with 512 or more cores in the current HPCC results
reports 2.3 microseconds. 

For large messages, the point to point bandwidth off-node is about 1.1
gigabytes/sec.  That aggregate capacity seems to be shared fairly
among all cores reading and writing, so HPCC random ring gets about
600 MB/sec per node on 108 nodes (1 core/node) and about 100 MB/sec
per core when all 648 cores of the SC648 are running at once.  Looking
on the HPCC Results page for machines of that scale I find that the
NEC SX-8, the Cray XT-3's, Columbia (Altix) and the new Intel Endeavor
cluster are faster.

Stream Triad gives 360 megabytes/sec when one core is active, and 340
megabytes/sec per core when all six cores are active at once.  We're
pleased that we can run all six cores at once with little
degradation. The core we are currently using supports only a single
outstanding cache miss and does not have a prefetch unit.

The memory controllers themselves have enough bandwidth to supply all
six cores, the DMA engine running the interconnect and the PCI express
on I/O nodes, all at once.  (At SC07 we measured 1100 MB/sec to a
Myricom 10G running MX.)

The main memory latency is about 104 nanoseconds, load to use, so the
number of clock cycles to main memory is quite low.

As a consequence of this balance: moderate speed cores, reasonably
low latency memory (although not extreme bandwidth), and quite fast
communications, benchmarks like HPCC Random Access run very well.
The SC5832, for example, measures around 2.25 using the Sandia Labs
version of the code, putting it sixth in current rankings behind
the big BlueGene, the big XT3s, a Cray X1, and the Intel Endeavor
cluster.  Cost and power consumption comparisons are left as
an exercise for the reader.

-Larry