[Beowulf] IB in the real world

Thu May 12 14:32:58 PDT 2005

I've been looking at the high performance interconnect options.
As one might expect every vendor sells certain strengths and accuses
the competition of certain weaknesses.  I can't think of a better place
to discuss these things.  The Beowulf list seems mostly vendor neutral,
er at least peer reviewed, and hopefully some end users actually using
the technology can provide some real world/end user perspectives.

So questions that come to my mind (but please feel free to add more):

1.  How good is the OpenIB mapper?  It periodically generates static
    routing tables maps of available IB nodes?  It's the critical piece
    for handling adding a node or removing a node and keeping a cluster
    functioning?  Reliable?

2.  How good is the OpenIB+MPI stack(s)?  Any reliable enough for large 
    month long jobs?  Which?  I've heard rumors of large IB clusters that 
    never met the acceptance criteria.  FUD or real?  Related to IB 
    reliability or performance?

3.  How good are the mappers that run inside various managed switches?
    Reliable?  Same code base?  Better or worse than the OpenIB mapper?

4.  IB requires pinned memory per node that increases with the total
    node count, true?  In all cases?  Exactly what is the formula for memory
    overhead?  It is per node?  IB card?  Per CPU?  Is the pinned memory     
    optional?  What are the performance implications of not having it?

5.  Routing is static?  Is there flow control?  Any handling of hot spots?
    How are trunked lines load balanced (i.e. 6 IB ports used as an uplink
    for a 24 port switch).  Load balancing across uplinks?  Arbitrary 
    topology (rings?  tree only? mesh?)  Static mapping between downlinks 
    and uplinks (no load balancing)?  Cut through or store and forward? 
    Both? When? Backpressure?

6.  What real world latencies and bandwidths are you observing on production
    clusters with MPI?  How much does that change when all nodes are running
    the latency or bandwidth benchmark?

7.  Using the top500 numbers to measure efficiency what would be a
    a good measure of interconnect efficiency?  Specifically RMax/RPeak     
    for a given similar size cluster?

8.  Are there more current HPC Challenge numbers than
    http://icl.cs.utk.edu/hpcc/hpcc_results.cgi?  Are these benchmark
    results included in all top500 submissions?  It seems like a good place
    to measure latency/bandwidth and any relation to cluster size.

9.  Most (all?) IB switches have Mellanox 24*4x chips in them?  What is
    the actual switch capacity of the chip?  20GBit*24?  Assuming a 
    particular clock speed?  Do switches run that clock speed?  4x SDR 
    per link?  DDR?

I'd be happy to summarize responses, or just track the discussion on
the list.  I'm of course interested in similar for quadrics, Myrinet, and
any other competitors in the Beowulf interconnect space.  Although maybe
that should be delayed for a week each.  Does anyone know of a better
place to ask such things and get a vendor neutral response (or at least
responses that are subject to peer review)?

Material sent to me directly, NOT covered by NDA, can be included in
my summary anonymously by request.

-- 
Bill Broadley
Computational Science and Engineering
UC Davis