[Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Gus Correa gus at ldeo.columbia.eduThu Sep 3 10:25:01 PDT 2009
- Previous message: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes
- Next message: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Rahul Nabar wrote: > On Thu, Sep 3, 2009 at 10:19 AM, Gus Correa<gus at ldeo.columbia.edu> wrote: >> See these small SDR switches: >> >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idcategory=7&idproduct=13 >> http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=10 >> >> And SDR HCA card: >> > > Thanks Gus! This info was very useful. A 24port switch is $2400 and > the card $125. Thus each compute node would be approximately $300 more > expensive. (How about infiniband cables? Are those special and how > expensive. I did google but was overwhelmed by the variety available.) > Hi Rahul IB cables (0.5-8m,$40-$109): http://www.colfaxdirect.com/store/pc/viewCategories.asp?pageStyle=m&idCategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=1&idcategory=2 http://www.colfaxdirect.com/store/pc/viewPrd.asp?idproduct=2&idcategory=2 etc ... > This isn't bad at all I think. If I base it on my curent node price > it would require only about a 20% performance boost to justify this > investment. I feel Infy could deliver that. When I had calculated it > the economics was totally off; maybe I had wrong figures. > > The price-scaling seems tough though. Stacking 24 port switches might > get a bit too cumbersome for 300 servers. It probably will. I will defer any comments to the network pros on the list. Here is a suggestion. I would guess that if you don't intend to run the codes, say, on more than 24-36 nodes at once, you might as well not stack all the small IB switches. I.e., you could divide the cluster IB-wise into smaller units, of perhaps 36 nodes or so, with 2-3 switches serving each unit. Not sure how to handle the IB subnet(s) manager in such a configuration, but there may be ways around. This scheme may take some scheduler configuration to handle MPI job submission, but it may save you money and hardware/cabling complexity, and still let you run MPI programs with a substantial number of processes. You can still fully connect the 300 nodes through Gbit Ether, for admin and I/O purposes, stacking 48-port GigE switches. IB is a separate (set of) network(s), which I assume will be dedicated to MPI only. You may want to check the 36-port IB switches also, but IIRR they are only DDR and QDR, not SDR, and somewhat more expensive. > But when I look at > corresponding 48 or 96 port switches the per-port-price seems to shoot > up. Is that typical? > I was told the current IB switch price threshold is 36-port. Above that it gets too expensive, the cost-effective solution is stacking smaller switches. I'm just passing the information/gossip along. >> For a 300-node cluster you need to consider >> optical fiber for the IB uplinks, > > You mean compute-node-to-switch and switch-to-switch connections? > Again, any $$$ figures, ballpark? > I would guess you may need optical fiber for switch-switch connections. Depending on the distance, of course, say, across two racks, if this type of connection is needed. Regular IB cables are probably able handle the node-switch links, if the switches are distributed across the racks. >> I don't know about your computational chemistry codes, >> but for climate/oceans/atmosphere (and probably for CFD) >> IB makes a real difference w.r.t. Gbit Ethernet. > > I have a hunch (just a hunch) that the computational chemistry codes > we use haven't been optimized to get the full advantage of the latency > benefits etc. Some of the stuff they do is pretty bizarre and > inefficient if you look at their source codes (writing to large I/O > files all the time eg.) I know this ought to be fixed but there that > seems a problem for another day! > Not only your Chem codes. Brute force I/O is rampant here also. Some codes take pains to improve MPI communication on the domain decomposition side, with asynchronous communication, etc, then squander it all by letting everybody do I/O in unison. (Hence, keep in mind Joshua's posting about educating users and adjusting codes to do I/O gently.) I hope this helps. Gus Correa --------------------------------------------------------------------- Gustavo Correa Lamont-Doherty Earth Observatory - Columbia University Palisades, NY, 10964-8000 - USA ---------------------------------------------------------------------
- Previous message: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes
- Next message: [Beowulf] recommendations for a good ethernet switch for connecting ~300 compute nodes
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
