[Beowulf] IB in the real world

Josh England jjengla at sandia.gov
Thu May 12 16:58:44 PDT 2005

On Thu, 2005-05-12 at 14:32 -0700, Bill Broadley wrote:
> I've been looking at the high performance interconnect options.
> As one might expect every vendor sells certain strengths and accuses
> the competition of certain weaknesses.  I can't think of a better place
> to discuss these things.  The Beowulf list seems mostly vendor neutral,
> er at least peer reviewed, and hopefully some end users actually using
> the technology can provide some real world/end user perspectives.
> So questions that come to my mind (but please feel free to add more):
> 1.  How good is the OpenIB mapper?

Thats myrinet talk.  I assume you're talking about the Subnet Manager
(OpenSM)?  Short answer: its good enough, but could certainly use some

>   It periodically generates static
>     routing tables maps of available IB nodes?
>   It's the critical piece
>     for handling adding a node or removing a node and keeping a cluster
>     functioning?  Reliable?

You need an SM for the IB fabric to work, yes.  AFAIK, you shouldn't
start seeing any issues with the SM until you start to getting up into
the 1000+ node count.

> 2.  How good is the OpenIB+MPI stack(s)?  Any reliable enough for large 
>     month long jobs?

No MPI has been ported to the OpenIB stack yet.  The verbs
implementation was just completed a couple months ago.  I believe a few
efforts are currently underway.  If you want MPI, you're stuck with a
vendor's IB stack for a little while yet.

>   Which?  I've heard rumors of large IB clusters that 
>     never met the acceptance criteria.  FUD or real?  Related to IB 
>     reliability or performance?
> 3.  How good are the mappers that run inside various managed switches?
>     Reliable?  Same code base?  Better or worse than the OpenIB mapper?

They work.

> 4.  IB requires pinned memory per node that increases with the total
>     node count, true?

Host memory or HCA memory?
This is true, especially for the connection-based protocols.  Newer VAPI
implementations have addressed this by implementing a shared receive
queue that the MPI can take advantage of (Mvapich-0.9.5 does) to reduce
memory consumption.  I'm pretty sure OpenIB has shared receive queues as
well.  Still, the memory consumption won't hurt too bad until you start
hitting 1000+ processes in a single job.

>   In all cases?  Exactly what is the formula for memory
>     overhead?  It is per node?  IB card?  Per CPU?  Is the pinned memory     
>     optional?  What are the performance implications of not having it?

Per connection.  Pinned memory is not optional -- you need to open up
send/receive queues for transferring data.  You can use the UD protocol
to reduce memory consumption although it is currently slower than RC.

> 5.  Routing is static?

Yes.  I think some (proprietary?) subnet managers may be capable of
assigning multiple LIDS to each HCA end point and derive multiple paths
between each end point.  I don't think OpenSM currently does this.  An
ideal solution would be to have a switch/SM capable of adaptive
dispersive routing.

>   Is there flow control?  Any handling of hot spots?
>     How are trunked lines load balanced (i.e. 6 IB ports used as an uplink
>     for a 24 port switch).  Load balancing across uplinks?  Arbitrary 
>     topology (rings?  tree only? mesh?)  Static mapping between downlinks 
>     and uplinks (no load balancing)?  Cut through or store and forward? 
>     Both? When? Backpressure?

The subnet manager typically does a good enough job of balancing the
routes, but they are still static.  The topology can be arbitrary, but
fat trees can provide better performance.

> 6.  What real world latencies and bandwidths are you observing on production
>     clusters with MPI?  How much does that change when all nodes are running
>     the latency or bandwidth benchmark?

I don't have the numbers offhand, but I recall about 7-9ns latency and
~=800MB/s on PCI-X and ~=1200MB/s on PCIe.
> 7.  Using the top500 numbers to measure efficiency what would be a
>     a good measure of interconnect efficiency?  Specifically RMax/RPeak     
>     for a given similar size cluster?
> 8.  Are there more current HPC Challenge numbers than
>     http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?  Are these benchmark
>     results included in all top500 submissions?  It seems like a good place
>     to measure latency/bandwidth and any relation to cluster size.
> 9.  Most (all?) IB switches have Mellanox 24*4x chips in them?  What is
>     the actual switch capacity of the chip?  20GBit*24?  Assuming a 
>     particular clock speed?  Do switches run that clock speed?  4x SDR 
>     per link?  DDR?

No DDR yet, but soon.


> I'd be happy to summarize responses, or just track the discussion on
> the list.  I'm of course interested in similar for quadrics, Myrinet, and
> any other competitors in the Beowulf interconnect space.  Although maybe
> that should be delayed for a week each.  Does anyone know of a better
> place to ask such things and get a vendor neutral response (or at least
> responses that are subject to peer review)?
> Material sent to me directly, NOT covered by NDA, can be included in
> my summary anonymously by request.
> -- 
> Bill Broadley
> Computational Science and Engineering
> UC Davis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list