[Beowulf] Re: Why Do Clusters Suck?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at mendel.bio.caltech.eduTue Mar 22 14:12:05 PST 2005
- Previous message: [Beowulf] For -- mostly -- Jeff Layton...;-)
- Next message: [Beowulf] Re: Why Do Clusters Suck?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> On Tue, 2005-03-22 at 12:42, David Mathog wrote: > > > > More to the point, while there is certainly a lot > > of room for improvement, an awful lot of work is > > getting done today using existing cluster technology > > and it's far from clear to me that an advance > > in cluster management software would result in much more > > productivity. As opposed to, for instance, improving > > network throughput, CPU power, or component reliability by > > a factor of 10, any one of which would lead to an immediate > > and dramatic productivity increase. > > > > Would it? Yes. Programs tend to either be CPU limited and/or bandwidth limited. If you improve the relevant components the program will speed up to the point that something else becomes the new bottleneck. For most of our work now the CPU or memory bandwidth is limiting but for some operations (data distribution) the network bandwidth is. > Myrinet or IB is more than enough bandwidth for > us Ok. Now imagine what would happen if you dropped back to 100baseT, which is what I'm still using. (weather and ocean modes, nearest neighbor communications), we prefer > better latency. > We have over a thousand nodes and hardware > reliability has never significantly impacted our users and their > productivity. We've lost up to 2 of our 20 nodes at a time. Most of our tasks depend upon particular data set slices being distributed across the nodes. When one node goes down it takes several hours to redistribute the data appropriately among the remaining nodes. If I had 1000 nodes this would become enough of a problem that I'd have to redo the data distribution method and build in something resembling a RAID like redundancy. > > Our biggest problem is the immaturity of development > tools. I feel your pain on that one. > It is all too common to hear > developers tell me things like "does it work if you turn off bounds > checking?". Egads! I'm a big fan of building and testing programs on as many completely different platforms as possible, and with every possible warning enabled. That does wonders for wringing latent bugs out of code. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech
- Previous message: [Beowulf] For -- mostly -- Jeff Layton...;-)
- Next message: [Beowulf] Re: Why Do Clusters Suck?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
