[Beowulf] Why Do Clusters Suck?
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Donald Becker becker at scyld.comTue Mar 22 16:18:31 PST 2005
- Previous message: [Beowulf] Why Do Clusters Suck?
- Next message: [Beowulf] Daisychained rcp script
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Wed, 23 Mar 2005, Stuart Midgley wrote: > On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote: > > So why do clusters suck? > From my position, this issue is really complex. In the Australian > scene, the main reason "clusters suck" has nothing to do with distros, > hardware or associated software. It is more an issue with support > staff... ... > Clusters, by their nature and design, are not simple beasts. Just like "Math is hard", "Computers are hard". But there many things that can be done to make clusters barely more difficult to use than single computers. If you are already used to running a cluster, you may not realize all of the extra complexity that you have introduced. This is especially true when you write ad hoc programs and scripts. When they work, everything is looks fine. But do they work in any other environment, or if anything changes, and what happens when they break? > When everything is running well, you can manage them with almost no staff. > However, when something goes wrong the diagnostic/resolution cycle can > be long and very complex. Yup. It's not how easy it looks when things go right, but how complex the system is when things go wrong. That's a corollary to "an abstraction layer is worse than useless when you have to look underneath". It's important to have diagnosable, documented tools and a system that is as simple as possible. > How to make clusters less sucky? Well, for a large cluster > users/system administrators, decent training would be a good start. > Training which takes people through the process of building, > installing, breaking and fixing a cluster. Of course, then there is > the MPI/application side of things which would be another course. Try > to wrap 10years worth of system/computational experience up into a 5 > days course ;) I'm the instructor for many of our introductory training courses. That is one my motivations make our system as simple as possible. Sometimes it's faster to write the code to avoid an exception to a general rule than to figure out how to explain it. A good example is handling heterogeneous hardware. I don't mean mixing Alphas with Itaniums with Opterons. I mean the gritty, everyday kind of minor system differences. Similar looking systems with a different Ethernet adapters. A mix of diskless and disk-based systems. Different versions of PXE. Toss in a few dual processor machines with one CPU removed, a mix of memory sizes, and that flaky disk that you can't quite admit is broken. Each of these differences can potentially be handled automatically. If you do a full install, the installer might handle the difference and you might not even notice them... until you consider long-term administration. What happens when you do an update? How do you recover when a system disk goes bad? A cluster need not be a collection of workstation environments, and treating it like one adds more complexity than someone with a lot of experience might initially perceive.
- Previous message: [Beowulf] Why Do Clusters Suck?
- Next message: [Beowulf] Daisychained rcp script
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
