How do you keep clusters running....

Fri Apr 5 09:59:56 PST 2002

On Thu, Apr 04, 2002 at 10:12:54AM -0300, Leandro Tavares Carneiro's all...
> We have here an beowulf cluster with 64 production nodes and 128
> processors, and we have some problems like you, about fans.
> Here, our cluster hardware is very cheap, using motherboards and cases
> founds easily in the local market, and the problems is critical.
> We have 5 spare nodes, and only 3 of that are ready to work. All our

[..]

> I think this kind of problem is inevitable with cheap PC parts, and can
> be lower with high-quality (and price) parts. We are making an study to
> by a new cluster, for another application and we call Compaq and IBM to
> see what they have in hardware and software, with the hope of a future
> with less problems...

You can always employ the 'maximum tolerable failure rate' concept and buy for
that rate. I find in terms of pricing equipment, there is a definite non
linear (exponential?) relationship between MTBF and price. For a failure rate
thats 3-5 times higher you can spend up to 40% less (or better) on equipment.
This isnt a solid number, but feels within the ballpark to me based on what
I've priced out before on clusters. Others may dispute this, but I am talking
about buying Dell 2U rackmount servers pre-assembled vs a bunch of boards and
CPUs and ram you slap together yourself.

Using this concept, and setting your maximum tolerable failure rate at a
specific level that suits your needs, for eg 1 node per month, coupled with an
agreesive RMA schedule with a good vendor, you can get the best price
performance out of a cluster.  If you can withstand, using my example, 3-5
times higher failure rate which ends up being 1 node per month, you end up
with 40% more gear.

If you require 100% of all nodes present to be in one mesh involved in
parallel calculations and a single node failure is catastrophic to the entire
job running since startup, then its obviously not worth it if your jobs have a
similar runtime as the failure rate (1 month). A failure rate of 1 node/5
months would work far better in that case, as the average failure would lose
you only 10% of the work you do in 5 months, whereas with 40% more equipment
and 5x the failure rate you may lose most of your work.  (Note I am not
considering that your jobs may run in [1 month / 1.4] instead due to the
speedup from more gear - which will cause jobs to run in ~70% of the time (~3
weeks) - and therefore have a higher success rate in finishing in the
1 node/mo MTBF environment.)

However, if your jobs run on all nodes for only a day, then a failure of a
single node once per month nets you a loss of a half day per month lost work
average. For this concession you get 40% more equipment (possibly meaning 40%
more processing power, depending on your application).

You also need to factor in how much personal time you have to deal with RMAing
and swapping equipment. This may well make any efforts towards this kind of
model impossible if extra time is not available. That notwithstanding, the
cost of extra time can be easily factored into the equation (and knowledgeable
work-study undergrads can be a REALLY cheap alternative here :)

Of course with 40% more power, you may configure two sub-clusters of 70% power
of the original HA design (HA = high availability ~ higher price).  If this
fits your needs, a failure of a single node once per month on average jobs of
a day in length will net you the equivalent loss of a quarter-day total
possible work. The more you isolate sections of the cluster from eachother,
the less you will lose when a failure occurs. If you can manually segment your
jobs to run one per node and still achieve near 100% (or more?)  of possible
capacity vs a more parallelized system, then a single node failure is
inconsequential.

Considering the amount and types of failures discussed here, there are
obviously no guarantee that a certain type of cluster setup will save you from
having massive problems. Being able to plan for downtime and manage the costs
associated with it is also obviously part of the design and operation of the
cluster. Its a seesaw-type of balance - if you want more nodes for less money,
be prepared to spend more time fixing them. Of course with any cluster, more
nodes of any type will logically translate into more down/service time - so
there will probably be a non-linear translation of amount of work when
comparing fewer HA nodes vs more cheaper nodes. Of course by this logic,
buying fewer bigger nodes would also result in less work. At some point
this becomes too expensive because you're buying big Suns that are
very expensive per GFLOPS (unless of course, it suits your needs best...).

Another problem with this whole situation that makes it even more complex is
that many cluster installations are subject to strange pricing/operation cost
models. Various parts may actually lie outside your budget responsability:

One time costs:
- design costs (on paper)
- equipment purchase
- equipment cosntruction/installation
- equipment configuration
- softwre installation & configuration

Long term/ongoing:
- software maintenance/reconfiguration
- upkeep/repair
- equipment upgrades
- power costs
- cooling costs

There are probably sub categories these could be split into as well.

The issue here is that, say in a university, power and cooling may be paid for
by the university as well as manual labour for upkeep and repair.  If that is
the case, then getting very power-inefficient but fast CPUs may work well (AMD
thunderbirds, for eg :). If you have to pay for your own power and cooling and
manual labour, then you may well just opt for spending more on cheaper gear
(Athlon XPs) - and at that point may as well go for HA gear as well (depending
on the cost model) to save expensive manual labour (at commercial rates
>$50/hr you can quickly rack up a node's cost in a day of work).

We have successfully employed the non-HA equipment deisgn in building one of
our clusters - and in fact there are added advantages. We have observed that
most (for various values of 'most' - 50% to 80%?) failures occur within the
first month of usage. Once you start swapping out bad nodes, you have a
falling rate of failure (though the age of components slowly catches up over a
long time period - things with moving parts, such as fans, especially). With
all problems taken together (swapping over NFS included, as these are diskless
nodes) we have about 1 node crash/fail in some way every 2 months. Of course,
since jobs can be checkpointed, and a single node failing doesnt take down the
whole cluster (as jobs are run on subsets of nodes) not much work is lost
overall. For the increased throughput from more nodes for the money, and
including about 15 minutes of work per month physically messing with the
machines thats directly related to hardware problems and crashes (ie unrelated
to the time spent maintaining the cluster as per normal operations), its been
an overall win on that particular cluster. (We have not had to RMA any
equipment since the start of the 2nd month of operation - under our current
service agreement, RMA would take 1-3 days, and about 20-30 min of labour, and
in the meantime not significantly impact the cluster's performance).

As always, designing your cluster customized for your needs and limitations is
always the biggest win on price/performance. Limitations to this are having
very wide ranges of needs and not having any idea of what capabilities will be
required in the future, along with expensive losses when there's downtime, and
expensive manual labour to get things working again. Barring these kinds of
considerations, commodity equipment with a failure rate that you can deal with
can net noticeable gains - having a planned failure cost related to that rate
will save you from suprises.

No matter what kind of cluster you build you WILL have failures, and designing
to be able to mitigate the impact from such to the highest possible extent is
obviously good planning.

/kc

> Em Qua, 2002-04-03 às 18:04, Cris Rhea escreveu:
> > 
> > What are folks doing about keeping hardware running on large clusters?
> > 
> > Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
> > 
> > Sure seems like every week or two, I notice dead fans (each RS-1200
> > has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
> > 
> > My last fan failure was a CPU fan that toasted the CPU and motherboard.
> > 
> > How are folks with significantly more nodes than mine dealing with constant
> > maintenance on their nodes?  Do you have whole spare nodes sitting around-
> > ready to be installed if something fails, or do you have a pile of
> > spare parts?  Did you get the vendor (if you purchased prebuilt systems)
> > to supply a stockpile of warranty parts?
> > 
> > One of the problems I'm facing is that every time something croaks, 
> > Racksaver is very good about replacing it under warranty, but getting
> > the new parts delivered usually takes several days.
> > 
> > For some things like fans, they sent extras for me to keep on-hand.
> > 
> > For my last fan/CPU/motherboard failure, the node pair will be 
> > down ~5 days waiting for parts.
> > 
> > Comments? Thoughts? Ideas?
> > 
> > Thanks-
> > 
> > --- Cris
> > 
> > 
> > 
> > ----
> >   Cristopher J. Rhea                      Mayo Foundation
> >   Research Computing Facility              Pavilion 2-25
> >   crhea at Mayo.EDU                        Rochester, MN 55905
> >   Fax: (507) 266-4486                     (507) 284-0587
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> -- 
> Leandro Tavares Carneiro
> Analista de Suporte
> EP-CORP/TIDT/INFI
> Telefone: 2534-1427
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Ken Chase, math at velocet.ca  *  Velocet Communications Inc.  *  Toronto, CANADA