[Beowulf] Re: Why Do Clusters Suck?
mathog at mendel.bio.caltech.edu
Tue Mar 22 14:12:05 PST 2005
> On Tue, 2005-03-22 at 12:42, David Mathog wrote:
> > More to the point, while there is certainly a lot
> > of room for improvement, an awful lot of work is
> > getting done today using existing cluster technology
> > and it's far from clear to me that an advance
> > in cluster management software would result in much more
> > productivity. As opposed to, for instance, improving
> > network throughput, CPU power, or component reliability by
> > a factor of 10, any one of which would lead to an immediate
> > and dramatic productivity increase.
> Would it?
Yes. Programs tend to either be CPU limited and/or bandwidth
limited. If you improve the relevant components the program
will speed up to the point that something else becomes the new
bottleneck. For most of our work now the CPU or memory bandwidth
is limiting but for some operations (data distribution) the
network bandwidth is.
> Myrinet or IB is more than enough bandwidth for
Ok. Now imagine what would happen if you dropped back to 100baseT,
which is what I'm still using.
(weather and ocean modes, nearest neighbor communications), we prefer
> better latency.
> We have over a thousand nodes and hardware
> reliability has never significantly impacted our users and their
We've lost up to 2 of our 20 nodes at a time. Most of our
tasks depend upon particular data set slices being distributed
across the nodes. When one node goes down it takes several
hours to redistribute the data appropriately among the
remaining nodes. If I had 1000 nodes this would become
enough of a problem that I'd have to redo the data distribution
method and build in something resembling a RAID like redundancy.
> Our biggest problem is the immaturity of development
I feel your pain on that one.
> It is all too common to hear
> developers tell me things like "does it work if you turn off bounds
Egads! I'm a big fan of building and testing programs on as many
completely different platforms as possible, and with every
possible warning enabled. That does wonders for wringing latent
bugs out of code.
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech
More information about the Beowulf