[Beowulf] New member, upgrading our existing Beowulf cluster
csamuel at vpac.org
Thu Dec 3 18:32:12 PST 2009
----- "Greg Lindahl" <lindahl at pbm.com> wrote:
> That kind of policy has a fairly high opportunity
> cost, even before you factor in linked nodes.
Well we cannot dictate to our users what they do,
we set a maximum walltime of 3 months and tell users
that they should checkpoint (if they have control of
the application and have coding skills).
> E.g. you see a system disk going bad, but the user
> will lose all their output unless the job runs for
> 4 more weeks...
We run SMART tests and the like trying to proactively
spot bad disks (and other hardware) prior to failures,
but yes, that's inevitable.
Christopher Samuel - (03) 9925 4751 - Systems Manager
The Victorian Partnership for Advanced Computing
P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
More information about the Beowulf