Disk reliability (Was: Node cloning)

kragen at pobox.com kragen at pobox.com
Fri Jun 22 22:51:13 PDT 2001

Mark Hahn <hahn at coffee.psychology.mcmaster.ca> writes:
> > What size of CRCs are being used?  If it's a 32-bit CRC and the errors
> > involved are likely to involve several bits, I think your chances of
> > having an uncaught data error are only four billion to one.  Four
> > billion microseconds is about eighty minutes, a billion milliseconds
> > is about a month and a half, and four billion seconds is about 125
> > years.
> hmm, I'll admit I never actually looked at the details.
> the CRC is 16b (not really surprising, since ATA is that wide):
> G(X) = X15 + X12 + X5 + 1.
> so I think your point was to be less blase' about badCRC reports,
> and you're certainly right.  hmm, so the chance of undetected errors
> depends on tranfers/second, right?  so figuring a worst-case ATA100
> and nothing but 4K transfers, we'd see something like 20K t/s.
> hmm, how do you go from those numbers to mean time to undetected failure?

Well, you need to know what the mean time to detected failure is.

> I think your back-of-envelope numbers were assuming 1 transfer per us,
> right?  so with 16b CRC, you'd expect an uncaught error in 64K/20K=3 s.
> but is that assuming some particular distribution of errors?

The three-second figure would be roughly correct if every transfer had
a many-bit error.  In general, you'd expect that one many-bit error
out of every 64K would be undetected.  (I think the CRC will detect
all one-bit errors, although I can't remember.)

If all your observed failures are one-bit errors, you could square
their frequency to get an estimate of the number of two-bit errors.  I
think "two" is a big enough number to be "many", but I'm not sure.

More information about the Beowulf mailing list