Mike Sullivan mike.sullivan at
Mon Feb 10 21:00:24 PST 2003


    There is no real right or wrong answer to this one unless you lie at 
extreme or the other in terms of line stability.  If you have very
frequent short term outages then you want to protect the whole cluster. If
you have rare long duration service loss then you want to protect the
filesystem on the master and fileservers and would provide backup for those
nodes. If you have frequent long duration outages by an abacus.

    Most clusters will lie in between the above extremes and you have to 
a balance that most suits your needs of either maximizing number of CPUS
or max uninterrupted run time.  Note you could protect part of the cluster
and designate a high availability queue. I would tend to agree with the 
other posters
about battery reliability in cheaper models so buy decent ones. You may wish
to test runtimes during maintenance shutdowns to catch dying batteries 

Things you should consider are the history of line stability 
specifically frequency and
duration of service loss. Usage patterns, are there likely to be a lot 
of long duration jobs.
Fraction of the users with long duration runs that do not have restart 
capabilities in their code.

                                Mike Sullivan

Mike Sullivan                           Director Performance Computing
@lliance Technologies,                  Voice: (416) 385-3255,
18 Wynford Dr, Suite 407                Fax:   (416) 385-1774
Toronto, ON, Canada, M3C-3S2            Toll Free:1-877-216-3199

More information about the Beowulf mailing list