Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Gus Correa gus at ldeo.columbia.edu
Thu Sep 3 10:25:01 PDT 2009


Rahul Nabar wrote:
> On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa<gus at ldeo.columbia.edu> wrote:
>> See these small SDR switches:
>>
>> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13
>> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10
>>
>> And SDR HCA card:
>>
> 
> Thanks Gus! This info was very useful. A 24port switch is $2400 and
> the card $125. Thus each compute node would be approximately $300 more
> expensive. (How about infiniband cables? Are those special and how
> expensive. I did google but was overwhelmed by the variety available.)
> 

Hi Rahul

IB cables (0.5-8m,$40-$109):

http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=2
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1&idcategory=2
http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2&idcategory=2
etc ...


> This isn't bad at all I think. If I base it on my curent node  price
> it would require only about a 20% performance boost to justify this
> investment. I feel Infy could deliver that. When I had calculated it
> the economics was totally off; maybe I had wrong figures.
> 
> The price-scaling seems tough though. Stacking 24 port switches might
> get a bit too cumbersome for 300 servers. 

It probably will.
I will defer any comments to the network pros on the list.

Here is a suggestion.
I would guess that if you don't intend to run the codes,
say, on more than 24-36 nodes at once, you might as well not stack all 
the small IB switches.
I.e., you could divide the cluster
IB-wise into smaller units, of perhaps 36 nodes or so, with 2-3
switches serving each unit.
Not sure how to handle the IB subnet(s) manager in such a configuration,
but there may be ways around.
This scheme may take some scheduler configuration to
handle MPI job submission,
but it may save you money and hardware/cabling complexity,
and still let you run MPI programs with a substantial
number of processes.

You can still fully connect the 300 nodes through Gbit Ether, for admin
and I/O purposes, stacking 48-port GigE switches.
IB is a separate (set of) network(s),
which I assume will be dedicated to MPI only.

You may want to check the 36-port IB switches also, but IIRR they are
only DDR and QDR, not SDR, and somewhat more expensive.


> But when I look at
> corresponding 48 or 96 port switches the per-port-price seems to shoot
> up. Is that typical?
> 

I was told the current IB switch price threshold is 36-port.
Above that it gets too expensive, the cost-effective
solution is stacking smaller switches.
I'm just passing the information/gossip along.

>> For a 300-node cluster you need to consider
>> optical fiber for the IB uplinks,
> 
> You mean compute-node-to-switch and switch-to-switch connections?
> Again, any $$$ figures, ballpark?
> 

I would guess you may need optical fiber for switch-switch connections.
Depending on the distance, of course,
say, across two racks, if this type of connection is needed.
Regular IB cables are probably able handle the node-switch links,
if the switches are distributed across the racks.

>> I don't know about your computational chemistry codes,
>> but for climate/oceans/atmosphere (and probably for CFD)
>> IB makes a real difference w.r.t. Gbit Ethernet.
> 
> I have a hunch (just a hunch) that the computational chemistry codes
> we use haven't been optimized to get the full advantage of the latency
> benefits etc. Some of  the stuff they do is pretty bizarre and
> inefficient if you look at their source codes (writing to large I/O
> files all the time eg.) I know this ought to be fixed but there that
> seems a problem for another day!
> 

Not only your Chem codes.
Brute force I/O is rampant here also.
Some codes take pains to improve MPI communication on the domain 
decomposition side, with asynchronous communication, etc,
then squander it all by letting everybody do I/O in unison.
(Hence, keep in mind Joshua's posting about educating users and
adjusting codes to do I/O gently.)

I hope this helps.
Gus Correa
---------------------------------------------------------------------
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
---------------------------------------------------------------------



More information about the Beowulf mailing list