Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Re NiftyOMPI Tom Mitchell <niftyompi@niftyegg.com>

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Greg Keller Greg at keller.net
Wed Aug 5 08:41:23 PDT 2009


>
Brian,

A 3rd option: upgrade your Chassis to 288 ports.  The beauty of SS/ 
Qlogic switches is they all use the same components.  The Chassis/ 
Backplane are relatively dumb and cheap.  You can re-use your spine  
switches and leaf switches.  You don't even need to add the additional  
spine switches if 2:1 blocking is OK.

Be very careful which ports you use to link the switches together if  
you do try and splice 2 chassis together.  SMs can have trouble  
mapping many configrations, and you're probably best off dedicating  
line cards as "Uplink" or "Compute" (but don't mix/match) if I recall  
the layouts correctly.  With these "multi-tiered" switches the SM  
sometimes can't figure out which way is up if you mix the ports  
apparently.

A 4th Option:  36 Port QDR + DDR
Also note that the QDR switches are based on 36 port chips and not a  
huge price jump (per port), so with a "Hybrid" cable for the uplinks,  
you may be able to purchase the newer technology and block the heck  
out of it.  So adding 48 additional nodes could be as easy as:

Disconnect 48 nodes for uplinks from the core switch
Connect 4 x 36 port QDR with 12 uplinks to each
Connect 48 old, and 48 new nodes to the 36 port QDR "edge"
This leaves you with 96 nodes on each side of a 48 port

Option 3 is the cleanest, and generically my favorite if you can get a  
chassis for a reasonable price.

Cheers!
Greg



> Date: Mon, 3 Aug 2009 22:29:50 -0700
> From: NiftyOMPI Tom Mitchell <niftyompi at niftyegg.com>
> Subject: Re: [Beowulf] Fabric design consideration
> To: "Smith, Brian" <brs at admin.usf.edu>
> Cc: "beowulf at beowulf.org" <beowulf at beowulf.org>
> Message-ID:
> 	<88815dc10908032229n35dc509clba0b1a52ab6af8f1 at mail.gmail.com>
> Content-Type: text/plain; charset=UTF-8
>
> On Thu, Jul 30, 2009 at 8:18 AM, Smith, Brian<brs at admin.usf.edu>  
> wrote:
>> Hi, All,
>>
>> I've been re-evaluating our existing InfiniBand fabric design for  
>> our HPC systems since I've been tasked with determining how we will  
>> add more systems in the future as more and more researchers opt to  
>> add capacity to our central system.  We've already gotten to the  
>> point where we've used up all available ports on the 144 port  
>> SilverStorm 9120 chassis that we have and we need to expand  
>> capacity.  One option that we've been floating around -- that I'm  
>> not particularly fond of, btw -- is to purchase a second chassis  
>> and link them together over 24 ports, two per spline.  While a good  
>> deal of our workload would be ok with 5:1 blocking and 6 hops (3  
>> across each chassis), I've determined that, for the money, we're  
>> definitely not getting the best solution.
>>
>> The plan that I've put together involves using the SilverStorm as  
>> the core in a spine-leaf design.  We'll go ahead and purchase a  
>> batch of 24 port QDR switches, two for each rack, to connect our  
>> 156 existing nodes (with up to 50 additional on the way).  Each  
>> leaf will have 6 links back to the spine for 3:1 blocking and 5  
>> hops (2 for the leafs, 3 for the spine).  This will allow us to  
>> scale the fabric out to 432 total nodes before having to purchase  
>> another spine switch.  At that point, half of the six uplinks will  
>> go to the first spine, half to the second.  In theory, it looks  
>> like we can scale this design -- with future plans to migrate to a  
>> 288 port chassis -- to quite a large number of nodes.  Also, just  
>> to address this up front, we have a very generic workload, with a  
>> mix of md, abinitio, cfd, fem, blast, rf, etc.
>>
>> If the good folks on this list would be kind enough to give me your  
>> input regarding these options or possibly propose a third (or  
>> forth) option, I'd very much appreciate it.
>>
>> Brian Smith
>
> I think the hop count is a smaller design issue than cable length for
> QDR.  Cable length and the
> physical layout of hosts in the machine room may prove to be the
> critical issue in
> your design.    Also since routing is static some seemingly obvious
> assumptions about
> routing, links, cross sectional bandwidth and blocking can be non- 
> obvious.
>
> Also less obvious to a group like this is your storage, job mix and
> batch system.
>
> For example in a single rack with a pair of QDR 24 port switches.  You
> might wish
> to have two or three links connecting those 24 port switches directly
> at QDR rates.
> Then the remaining three or four links would connect (DDR?) back to
> the 144 switch.
> If the batch system was 'rack aware' jobs that could run on a single
> rack would and
> jobs that had ranks scattered about would see a lightly loaded  
> central switch.
>
> Adding QDR to the mix as you scale out to 400+ nodes using newer multi
> core processor
> nodes could be fun.
>
> When you knock on vendor doors ask about optical links...  QDR optical
> links may let you reach
> beyond some classic fabrics layouts as your machine room and cpu core
> count grows.
>
> -- 
>        NiftyOMPI
>        T o m   M i t c h e l l
>
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 4 Aug 2009 14:48:21 -0400
> From: Brock Palen <brockp at umich.edu>
> Subject: [Beowulf] force factory rest of sfs7000 (topspin 120)
> To: Bewoulf <beowulf at beowulf.org>
> Message-ID: <635DE2F6-3A2C-4A58-91F1-072288667650 at umich.edu>
> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes
>
> We have a cisco sfs7000 (maybe still under support waiting on cisco)
> also known as a topspin 120, IB switch.
>
> We cannot login with the password we (thought) had it set to. I have
> looked online and find little tonight about forcing the switch back to
> factory defaults without a login.
>
> Serial console works fine, just can't login.  We can screw in firmware
> a little by stopping boot, just don't know what to do from there.  If
> anyone has directions how to force sfs7000 to factory defaults, or
> password recovery help would be great.
>
> Brock Palen
> www.umich.edu/~brockp
> Center for Advanced Computing
> brockp at umich.edu
> (734)936-1985
>
>
>
>
>
> ------------------------------
>
> _______________________________________________
> Beowulf mailing list
> Beowulf at beowulf.org
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
> End of Beowulf Digest, Vol 66, Issue 3
> **************************************




More information about the Beowulf mailing list