[Beowulf] Fabric design consideration

Mon Aug 3 22:29:50 PDT 2009

On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian<brs at admin.usf.edu> wrote:
> Hi, All,
>
> I've been re-evaluating our existing InfiniBand fabric design for our HPC systems since I've been tasked with determining how we will add more systems in the future as more and more researchers opt to add capacity to our central system.  We've already gotten to the point where we've used up all available ports on the 144 port SilverStorm 9120 chassis that we have and we need to expand capacity.  One option that we've been floating around -- that I'm not particularly fond of, btw -- is to purchase a second chassis and link them together over 24 ports, two per spline.  While a good deal of our workload would be ok with 5:1 blocking and 6 hops (3 across each chassis), I've determined that, for the money, we're definitely not getting the best solution.
>
> The plan that I've put together involves using the SilverStorm as the core in a spine-leaf design.  We'll go ahead and purchase a batch of 24 port QDR switches, two for each rack, to connect our 156 existing nodes (with up to 50 additional on the way).  Each leaf will have 6 links back to the spine for 3:1 blocking and 5 hops (2 for the leafs, 3 for the spine).  This will allow us to scale the fabric out to 432 total nodes before having to purchase another spine switch.  At that point, half of the six uplinks will go to the first spine, half to the second.  In theory, it looks like we can scale this design -- with future plans to migrate to a 288 port chassis -- to quite a large number of nodes.  Also, just to address this up front, we have a very generic workload, with a mix of md, abinitio, cfd, fem, blast, rf, etc.
>
> If the good folks on this list would be kind enough to give me your input regarding these options or possibly propose a third (or forth) option, I'd very much appreciate it.
>
> Brian Smith

I think the hop count is a smaller design issue than cable length for
QDR.  Cable length and the
physical layout of hosts in the machine room may prove to be the
critical issue in
your design.    Also since routing is static some seemingly obvious
assumptions about
routing, links, cross sectional bandwidth and blocking can be non-obvious.

Also less obvious to a group like this is your storage, job mix and
batch system.

For example in a single rack with a pair of QDR 24 port switches.  You
might wish
to have two or three links connecting those 24 port switches directly
at QDR rates.
Then the remaining three or four links would connect (DDR?) back to
the 144 switch.
If the batch system was 'rack aware' jobs that could run on a single
rack would and
jobs that had ranks scattered about would see a lightly loaded central switch.

Adding QDR to the mix as you scale out to 400+ nodes using newer multi
core processor
nodes could be fun.

When you knock on vendor doors ask about optical links...  QDR optical
links may let you reach
beyond some classic fabrics layouts as your machine room and cpu core
count grows.

-- 
        NiftyOMPI
        T o m   M i t c h e l l