Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] IB in the real world

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Josh England jjengla at sandia.gov
Thu May 12 16:58:44 PDT 2005


On Thu, 2005-05-12 at 14:32 -0700, Bill Broadley wrote:
> I've been looking at the high performance interconnect options.
> As one might expect every vendor sells certain strengths and accuses
> the competition of certain weaknesses.  I can't think of a better place
> to discuss these things.  The Beowulf list seems mostly vendor neutral,
> er at least peer reviewed, and hopefully some end users actually using
> the technology can provide some real world/end user perspectives.
> 
> So questions that come to my mind (but please feel free to add more):
>                                                                                 
> 1.  How good is the OpenIB mapper?

Thats myrinet talk.  I assume you're talking about the Subnet Manager
(OpenSM)?  Short answer: its good enough, but could certainly use some
improvement.  

>   It periodically generates static
>     routing tables maps of available IB nodes?
>   It's the critical piece
>     for handling adding a node or removing a node and keeping a cluster
>     functioning?  Reliable?

You need an SM for the IB fabric to work, yes.  AFAIK, you shouldn't
start seeing any issues with the SM until you start to getting up into
the 1000+ node count.

>                                                                                 
> 2.  How good is the OpenIB+MPI stack(s)?  Any reliable enough for large 
>     month long jobs?

No MPI has been ported to the OpenIB stack yet.  The verbs
implementation was just completed a couple months ago.  I believe a few
efforts are currently underway.  If you want MPI, you're stuck with a
vendor's IB stack for a little while yet.

>   Which?  I've heard rumors of large IB clusters that 
>     never met the acceptance criteria.  FUD or real?  Related to IB 
>     reliability or performance?
>                                                                                 
> 3.  How good are the mappers that run inside various managed switches?
>     Reliable?  Same code base?  Better or worse than the OpenIB mapper?

They work.

>                                                                                 
> 4.  IB requires pinned memory per node that increases with the total
>     node count, true?

Host memory or HCA memory?
This is true, especially for the connection-based protocols.  Newer VAPI
implementations have addressed this by implementing a shared receive
queue that the MPI can take advantage of (Mvapich-0.9.5 does) to reduce
memory consumption.  I'm pretty sure OpenIB has shared receive queues as
well.  Still, the memory consumption won't hurt too bad until you start
hitting 1000+ processes in a single job.

>   In all cases?  Exactly what is the formula for memory
>     overhead?  It is per node?  IB card?  Per CPU?  Is the pinned memory     
>     optional?  What are the performance implications of not having it?

Per connection.  Pinned memory is not optional -- you need to open up
send/receive queues for transferring data.  You can use the UD protocol
to reduce memory consumption although it is currently slower than RC.

>                                                                                 
> 5.  Routing is static?

Yes.  I think some (proprietary?) subnet managers may be capable of
assigning multiple LIDS to each HCA end point and derive multiple paths
between each end point.  I don't think OpenSM currently does this.  An
ideal solution would be to have a switch/SM capable of adaptive
dispersive routing.

>   Is there flow control?  Any handling of hot spots?
>     How are trunked lines load balanced (i.e. 6 IB ports used as an uplink
>     for a 24 port switch).  Load balancing across uplinks?  Arbitrary 
>     topology (rings?  tree only? mesh?)  Static mapping between downlinks 
>     and uplinks (no load balancing)?  Cut through or store and forward? 
>     Both? When? Backpressure?


The subnet manager typically does a good enough job of balancing the
routes, but they are still static.  The topology can be arbitrary, but
fat trees can provide better performance.

>                                                                                 
> 6.  What real world latencies and bandwidths are you observing on production
>     clusters with MPI?  How much does that change when all nodes are running
>     the latency or bandwidth benchmark?

I don't have the numbers offhand, but I recall about 7-9ns latency and
~=800MB/s on PCI-X and ~=1200MB/s on PCIe.
                                                                                
> 7.  Using the top500 numbers to measure efficiency what would be a
>     a good measure of interconnect efficiency?  Specifically RMax/RPeak     
>     for a given similar size cluster?
>                                                                                 
> 8.  Are there more current HPC Challenge numbers than
>     http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?  Are these benchmark
>     results included in all top500 submissions?  It seems like a good place
>     to measure latency/bandwidth and any relation to cluster size.
> 
> 9.  Most (all?) IB switches have Mellanox 24*4x chips in them?  What is
>     the actual switch capacity of the chip?  20GBit*24?  Assuming a 
>     particular clock speed?  Do switches run that clock speed?  4x SDR 
>     per link?  DDR?

No DDR yet, but soon.

-JE

> 
> I'd be happy to summarize responses, or just track the discussion on
> the list.  I'm of course interested in similar for quadrics, Myrinet, and
> any other competitors in the Beowulf interconnect space.  Although maybe
> that should be delayed for a week each.  Does anyone know of a better
> place to ask such things and get a vendor neutral response (or at least
> responses that are subject to peer review)?
> 
> Material sent to me directly, NOT covered by NDA, can be included in
> my summary anonymously by request.
> 
> -- 
> Bill Broadley
> Computational Science and Engineering
> UC Davis
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 




More information about the Beowulf mailing list