large clusters and topologies

Steffen Persvold sp at scali.no
Sat Jul 29 00:33:11 PDT 2000


Patrick GEOFFRAY wrote:
> 
> Steffen Persvold wrote:
> 
> > > > Any alternatives to Myrinet upwards of that?
> > >
> > > Not really. It depends.
> > SCI is a great alterative if you need high network bandwidth.
> > It is also highly scalable.
> 
> "highly scalable" again :-)
> 
> SCI does not seems "highly scalable", on the paper and in reality:
> On the paper, it's a shared bus design and the switch available
> provides 6 ports. A link support 400 MB/s full duplex, so you can
> put 2 nodes per SCI bus if you want to saturate your PCI 64/66.
> The aggregate bandwidth of the switch is 1.28 GB/s, but 6 ports at
> 2*400 MB/s = 4.7 GB/s, it's far to be a full cross-bar. You can
> cascade up to 12 switchs using 2 expansion ports per switch (4
> user ports per switch available in this case), that means you
> cannot keep an interesting bissection between all of the nodes of
> your cluster.
> In reality, the biggest SCI cluster that i know is in Paderborn :
> 96 nodes. I have talked in Mannheim with people who worked on it,
> and it seems to be working as 3 clusters of 32 nodes.
>
> I can say i have a scalable cluster if i can add machines without
> decreasing dramatically the communication performance between each
> peer of nodes. The "dramatically" part is very fuzzy : if you need
> 10 KB/s of bandwidth between your nodes, OK, you will scale very
> well with Ethernet 10, Fast Ethernet or SCI. If you need 200 MB/s,
> that's dramatically a problem. :-)

OK, lets make things clear about wether SCI is scalable or not, since
some has clearly misunderstood the SCI scalable technology. Please don't
read this as marketing but more like a mini mini paper on scalable
symetric multicubes, the full paper in pdf form can be found at :
http://www.scali.com/whitepaper/scieurope98/scale_paper.pdf.

First of all you should know that scalable SCI clusters doesen't use
switches, but is connected in a torus topology. This is possible because
the adapters can switch traffic themselves between several link
controllers (LC). In fact the 6-port SCI-switch is basically 6 LC's
connected together with the B-Link (Backside Link for SCI link
controllers). Thus you don't need more than one adapter on the PCI bus,
just plug on an addoncard (mezzanine) to the adapter, and you have a
torus topology instead of a single ringlet. Up to two mezzanines can be
connected (3D).

SCI adapters are available in 32-bit 33MHz, and 64-bit 33/66MHz.
Bus bandwidth on the B-Link is 400 MByte/s, while the LC link is 500
MByte/s (800 MHz on 64-bit/66MHz). This evaluation uses the 32-bit
33MHz.

To evaluate scalability, we will se how the different parts of the
interconnect (B-Link and the LC link) restricts the bandwidth in the
case where all nodes communicate with all other nodes. The restriction
in interconnect bandwidth which stems from the PCI bus is independent of
the size and topology of the system. This is because the PCI bus only
restricts the rate which the interconnect can deliver packets to a node
and the rate which the node itself emits packets to the interconnect.

Bandwidth caclulation:
[See paper for detailed information on formulas]

For systems with 6 or less number of nodes, a single ringlet is the best
choice. The bandwidth is restricted to 100 MByte/s.

For 64 nodes the bandwidth is restricted to 77 MByte/s per node in a
2D-torus, whereas a 3D-torus of the same size limits the bandwith to 61
MByte/s per node. Clearly a 2D-torus is the best choice. As the number
passes 100, the 3D torus becomes the best choice, and this holds true
for up to 1000+ nodes. A 3D torous having 1000 nodes will sustain a
bandwidth of 54 MByte/s. 

The PCI (Intel 440BX) bus used in these tests had an efficiency about
70% on the write burst cycles to/from the SCI/PCI adapter board. Since
the PCI will be burdened with traffic in both directions, PCI will
restrict the available bandwidth per node to half of the sustained peek
in one direction, i.e roughly 46 MBytes/s.

CONCLUSION: When the bandwidth provided by the SCI interconnect is
higher than one provided on PCI, the scalability in terms of bandwidth
is linear up to 1700 nodes (assuming a 3D-torus).

An Excel worksheet (72kb) with all formulas and nice graphs showing the
crossing points for different topologies, is also available on request.
(I'd like to see this on other interconnects as well, so that we can
_really_ compare them)

Finally; if anyone feels offended by getting this information, I am
sorry.
This was only ment to enlighten those of you who thought SCI clusters
didn't scale.

Best Regards,
-- 
  Steffen Persvold               Systems Engineer
  Email : mailto:sp at scali.no     Scali AS (http://www.scali.com)
  Tlf   : (+47) 22 62 89 50      Olaf Helsets vei 6
  Fax   : (+47) 22 62 89 51      N-0621 Oslo, Norway




More information about the Beowulf mailing list