[Beowulf] Re: HPC fault tolerance using virtualization)

Mon Jun 29 05:33:37 PDT 2009

Greg Lindahl <lindahl at pbm.com> writes:

>> What I typically see from smartd is alerts when one or more sectors has
>> already gone bad, although that tends not to be something that will
>> clobber the running job.  How should it be configured to do better
>> (without noise)?
>
> That isn't noise, that's signal.

Of course I didn't mean that bad block alerts were noise.  However,
there is what I and a hardware expert think is noise from the default
smartd configuration.  I'm interested in how best to configure it for
useful warnings.  I did have a look OTW, of course.

> You're just lucky that your running
> job doesn't need the data off the bad sector.

Not if the problem is, say, on /usr, which the job normally isn't going
to need before it finishes.

> You can try waiting
> until the job finishes before taking the node out of service; from the
> sounds of it, you will usually win. But if you don't have
> application-level end-to-end checksums of your data, how do you know
> if you won or not?

I know where the job is doing i/o, and I'm not going to kill multi-day,
multi-node jobs -- especially not automatically -- because there's a bad
sector somewhere irrelevant.  Also we have better things to worry about
here, at least, than application checksums, much as they might feature
in an ideal world.