Kidger's comments on Quadric's design and performance

Tue Apr 23 01:01:18 PDT 2002

Richard Fryer wrote:
> 
> On Fri, 19 Apr 2002 14:06:00 +0100
> Daniel Kidger <Daniel.Kidger at quadrics.com> wrote:
> 
> > after all as well as having the fastest line-speed, the Quadrics
> > interconnect sends all data as virtual addresses (the NIC has its
> > own MMU and TLB). That way any process can read and write
> > the memory of any other node without any CPU overhead.
> 
> I appreciate getting a bit of technical detail on Quadrics interfaces.  Is
> there a web location that might provide more information - comparative
> benchmarks or protocol information or ???

Of course www.quadrics.com, and Fabrizio Petrini is doing a lot of
evaluation work (http://www.c3.lanl.gov/~fabrizio, esp.
http://www.c3.lanl.gov/~fabrizio/quadrics.html).

> This message also reminded me to ask if a long-held opinion is valid - and
> that opinion is "that a cache coherent interconnect would offer performance
> enhancement when applications are at the 'more tightly coupled' end of the
> spectrum."  I know that present PCI based interfaces can't do that without
> invoking software overhead and latencies.  Anyone have data - or an argument
> for invalidating this opinion?

You would need another programming model than MPI for that (see below),
maybe OpenMP as you basically have the characteristics of a SMP system
with cc-NUMA architecture.

> I did recently read that the AMD 'HyperTransport' interfaces ARE capable of
> cache coherent transactions.  This would appear to allow protocols (such as
> SCI) that support cache coherence to operate in that mode.  But I wonder if
> it matters to the MPI world.  Seems to me that it would be a factor in
> improving scalability (providing that other interconnect issues such as
> bandwidth bottlenecks) don't prevent it.  My recollection is that the SCI
> simulations I saw required very little added traffic to maintain coherency.

This is true (for an introduction, see
http://www.SCIzzL.com/HowSCIcohWorks.html).

However, for MPI, cache-coherence would not really add a performance
benefit. MPI is designed to be efficient with "write-only" protocols.
One-sided communication may benefit from it, but other techniques like
Cray SHMEM do the same w/o cache-coherence.

And I do not expect anybody except AMD or chipset designers to design
network adapters / bus bridges for something propietary like
HyperTransport...

> Also a brief note about the Dolphin product line, since the issue of link
> saturation has come up:  - they DO also sell switches - or at least offer
> them.  And if you check the SCI specification, you'll see that there are
> some elaborate discussions of fabric architectures that the protocol
> supports and switches enable.  What I DO NOT know is if the SCALI software
> supports switch-based operation, and also don't know what the impact is on
> the system cost per node.  My 'inexperienced' assessment of the appeal in
> the Dolphin family is that you can start without the switch and later add it
> if the performance benefit warrents.  That's what I'd say if I were selling
> them anyway - and didn't know otherwise.  :-)

The "external" switches are not designed for large-scale HPC
applications (although they scale quite well inside the range of their
supported number of nodes), but for high-performance, high-availabitlity
small-scale cluster or embedded applications, as i.e. Sun sells. With
ext. switches, you don't have to do anything to keep the network up if a
node fails (and also nothing if it comes back as SCI is not
source-routed). In torus topologies, re-routing needs to be applied to
bypass bad nodes (Scali does this on-the-fly).

Scali does not support external switches AFAIK (at least doesn't sell
such systems any longer), which is less a technical issue but more a
design-issue as the topology is fully transparent for the nodes
accessing the network (they did use switches in the past, see
http://www.scali.com/whitepaper/ehpc97/slide_9.html). 

For large scale applications, distributed switches as in torus
topologies scale better and more cost-efficient (see
http://www.scali.com/whitepaper/scieurope98/scale_paper.pdf and other
resources). With switches, you need *a lot* of cables and switches
(which doesn't hinder Quadrics to do so - resulting in an impressive 14
miles of cables for a recent system (IIRC) with single cables being up
to 25m in length). It would need to be verified if such a system build
with a Quadrics-like fat-tree topologie using Dolphins 8-port switches
would scale better than the equivalent torus topologie for different
communication patterns. I doubt it. At least, the interconect would cost
a lot more (at least twice, or even more depending on the dimension of
the tree).

SCI-MPICH, can be used with arbitraries SCI topologies (because it uses
the SISCI interface and thus runs with Scali or Dolphin SCI drivers). It
is not that closely coupled to the SCI drivers as ScaMPI is.

 Joachim

-- 
|  _  RWTH|  Joachim Worringen
|_|_`_    |  Lehrstuhl fuer Betriebssysteme, RWTH Aachen
  | |_)(_`|  http://www.lfbs.rwth-aachen.de/~joachim
    |_)._)|  fon: ++49-241-80.27609 fax: ++49-241-80.22339