[Beowulf] Re: HPC fault tolerance using virtualization

Sun Jun 28 17:21:26 PDT 2009

On Sun, Jun 28, 2009 at 12:17:50PM +0100, Dave Love wrote:

> > and for disks run a smartctl test and see if a disk is showing
> > symtopms which might make it fail in future.
> 
> What I typically see from smartd is alerts when one or more sectors has
> already gone bad, although that tends not to be something that will
> clobber the running job.  How should it be configured to do better
> (without noise)?

That isn't noise, that's signal. You're just lucky that your running
job doesn't need the data off the bad sector. You can try waiting
until the job finishes before taking the node out of service; from the
sounds of it, you will usually win. But if you don't have
application-level end-to-end checksums of your data, how do you know
if you won or not?

In my big MapReduce cluster (800 data disks), about 2/3 of the time I'll
see an I/O error in my application, or checksum failure, and 1/3 of the time
I will see a smartd error and no application error.

-- greg