[Beowulf] Why Do Clusters Suck?
stuart.midgley at anu.edu.au
Tue Mar 22 15:34:08 PST 2005
On 22/03/2005, at 7:36, Douglas Eadline - ClusterWorld Magazine wrote:
> So why do clusters suck?
From my position, this issue is really complex. In the Australian
scene, the main reason "clusters suck" has nothing to do with distros,
hardware or associated software. It is more an issue with support
staff. It is easy to buy hardware, software and download a distro.
However, it is very difficult to get good support staff.
Clusters, by their nature and design, are not simple beasts. When
everything is running well, you can manage them with almost no staff.
However, when something goes wrong the diagnostic/resolution cycle can
be long and very complex.
An error in an MPI program could be the actual user code, the MPI
layer, a system software issue, the interconnect, some hardware failure
or a combination of all three. Getting good staff to understand and
handle all these layers is difficult. Spending $100k will get you a
reasonable sized cluster on the floor within a few weeks, which will
last say 3 years. Yet, in the staff space $100k doesn't even get a
good system administrator for a single year. And, a system
administrator is not always what is required. They may not have a good
understanding of MPI/applications etc.
How to make clusters less sucky? Well, for a large cluster
users/system administrators, decent training would be a good start.
Training which takes people through the process of building,
installing, breaking and fixing a cluster. Of course, then there is
the MPI/application side of things which would be another course. Try
to wrap 10years worth of system/computational experience up into a 5
days course ;)
Dr Stuart Midgley | stuart.midgley at anu.edu.au
Supercomputer Facility | smidgley at netspace.net.au
Leonard Huxley Building 56 | +61 (0)2 6125 5988 Work
Australian National University | +61 (0)2 6125 8199 Fax
CANBERRA ACT 0200 | +61 (0)4 1125 2488 Mob
More information about the Beowulf