Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

Disk reliability (Was: Node cloning)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Mark Hahn hahn at coffee.psychology.mcmaster.ca
Sun May 27 09:23:02 PDT 2001


> > > You can try using hdparm to turn the DMA off.  Of course, it does slow
> > > down data transfer rates considerably.
> > 
> > As Mark said, BadCRC only means that the transfer was retried.  If a few
> > BadCRC messages are the only problem, I would not turn off DMA.
> 
> What size of CRCs are being used?  If it's a 32-bit CRC and the errors
> involved are likely to involve several bits, I think your chances of
> having an uncaught data error are only four billion to one.  Four
> billion microseconds is about eighty minutes, a billion milliseconds
> is about a month and a half, and four billion seconds is about 125
> years.

hmm, I'll admit I never actually looked at the details.
the CRC is 16b (not really surprising, since ATA is that wide):
G(X) = X15 + X12 + X5 + 1.

so I think your point was to be less blase' about badCRC reports,
and you're certainly right.  hmm, so the chance of undetected errors
depends on tranfers/second, right?  so figuring a worst-case ATA100
and nothing but 4K transfers, we'd see something like 20K t/s.
hmm, how do you go from those numbers to mean time to undetected failure?

I think your back-of-envelope numbers were assuming 1 transfer per us,
right?  so with 16b CRC, you'd expect an uncaught error in 64K/20K=3 s.
but is that assuming some particular distribution of errors?

thanks, mark hahn.






More information about the Beowulf mailing list