[Beowulf] Fabric design consideration
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
NiftyOMPI Tom Mitchell niftyompi at niftyegg.comMon Aug 3 22:29:50 PDT 2009
- Previous message: [Beowulf] Lustre Featured on Podcast
- Next message: [Beowulf] force factory rest of sfs7000 (topspin 120)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian<brs at admin.usf.edu> wrote: > Hi, All, > > I've been re-evaluating our existing InfiniBand fabric design for our HPC systems since I've been tasked with determining how we will add more systems in the future as more and more researchers opt to add capacity to our central system. We've already gotten to the point where we've used up all available ports on the 144 port SilverStorm 9120 chassis that we have and we need to expand capacity. One option that we've been floating around -- that I'm not particularly fond of, btw -- is to purchase a second chassis and link them together over 24 ports, two per spline. While a good deal of our workload would be ok with 5:1 blocking and 6 hops (3 across each chassis), I've determined that, for the money, we're definitely not getting the best solution. > > The plan that I've put together involves using the SilverStorm as the core in a spine-leaf design. We'll go ahead and purchase a batch of 24 port QDR switches, two for each rack, to connect our 156 existing nodes (with up to 50 additional on the way). Each leaf will have 6 links back to the spine for 3:1 blocking and 5 hops (2 for the leafs, 3 for the spine). This will allow us to scale the fabric out to 432 total nodes before having to purchase another spine switch. At that point, half of the six uplinks will go to the first spine, half to the second. In theory, it looks like we can scale this design -- with future plans to migrate to a 288 port chassis -- to quite a large number of nodes. Also, just to address this up front, we have a very generic workload, with a mix of md, abinitio, cfd, fem, blast, rf, etc. > > If the good folks on this list would be kind enough to give me your input regarding these options or possibly propose a third (or forth) option, I'd very much appreciate it. > > Brian Smith I think the hop count is a smaller design issue than cable length for QDR. Cable length and the physical layout of hosts in the machine room may prove to be the critical issue in your design. Also since routing is static some seemingly obvious assumptions about routing, links, cross sectional bandwidth and blocking can be non-obvious. Also less obvious to a group like this is your storage, job mix and batch system. For example in a single rack with a pair of QDR 24 port switches. You might wish to have two or three links connecting those 24 port switches directly at QDR rates. Then the remaining three or four links would connect (DDR?) back to the 144 switch. If the batch system was 'rack aware' jobs that could run on a single rack would and jobs that had ranks scattered about would see a lightly loaded central switch. Adding QDR to the mix as you scale out to 400+ nodes using newer multi core processor nodes could be fun. When you knock on vendor doors ask about optical links... QDR optical links may let you reach beyond some classic fabrics layouts as your machine room and cpu core count grows. -- NiftyOMPI T o m M i t c h e l l
- Previous message: [Beowulf] Lustre Featured on Podcast
- Next message: [Beowulf] force factory rest of sfs7000 (topspin 120)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
