[Beowulf] 512 nodes Myrinet cluster Challanges

Mon May 1 13:18:23 PDT 2006

> On Fri, 28 Apr 2006, David Kewley wrote:
> > By the way, the idea of rolling-your-own hardware on a large cluster,
> > and planning on having a small technical team, makes me shiver in
> > horror.  If you go that route, you better have *lots* of experience
> > in clusters. and make very good decisions about cluster components
> > and management methods. If you don't, your users will suffer
> > mightily, which means you will suffer mightily too.

On Friday 28 April 2006 16:36, Robert G. Brown wrote:
> I >>have<< lots of experience in clusters and have tried rolling my own
> nodes for a variety of small and medium sized clusters. Let me clarify.
> For clusters with more than perhaps 16 nodes, or EVEN 32 if you're
> feeling masochistic and inclined to heartache:
>
> Don't.
>
> Or you will have a really high probability of being very, very sorry.

On Sunday 30 April 2006 09:42, Mark Hahn wrote:
> I believe that overstates the case significantly.
>
> some clusters are just plain easy.  it's entirely possible to buy a
> significant number of conservative compute nodes, toss them onto a
> generic switch or two, and run the whole thing for a couple years without
> any real effort.  I did it, and while I have a lot of experience, I
> didn't apply any deep voodoo for the cluster I'm thinking of.  it started
> out with a good solid login/file/boot server (4U, 6x scsi, dual-xeon 2.4,
> 1G ram), a single 48pt 100bt (1G up) switch, and 48 dual-xeon nodes
> (diskful but not disk-booting).  it was a delight to install, maintain
> and manage. I originally built it with APC controllable PDUs, but in the
> process of moving it, stripped them out as I didn't need them.  (I _do_
> always require net-IPMI on anything newly purchased.)  I've added more
> nodes to the cluster since then - dual-opteron nodes and a couple GE
> switches.
>
> > For clusters with more than perhaps 16 nodes, or EVEN 32 if you're
> > feeling masochistic and inclined to heartache:
>
> with all respect to rgb, I don't think size is a primary factor in
> cluster building/maintaining/etc effort.  certainly it does eventually
> become a concern, but that's primarily a statistical result of
> MTBF/nnodes.  it's quite possible to choose hardware to maximize MTBF and
> configuration risk.

Ah, so my opinion is midway between Mark's & RGB's.  A very nice place to 
sit. :)

David