[Beowulf] Consumer vs. Enterprise Hard Drives in Clusters
eugen at leitl.org
Sat Jan 24 01:49:46 PST 2009
On Fri, Jan 23, 2009 at 02:13:02PM -0800, Bill Broadley wrote:
> I've seen little correlation between weight and vibration. After all even the
> built like a tank hardware is still noisy.
If yelling at a RAID array in a noisy center causes a latency peak obviously
the drives themselves are susceptible. The cover plate is thin, after all.
Another reason to look forward to SSDs.
> Just a delay between read/write and the answer. Usually there is a timeout,
> after all a completely dead drive might never answer.
Does anyone know whether WDTLER.EXE http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery
still works on modern non-RAID-Edition WD Green lines? The price difference is some
50 EUR for TByte drives.
> Well you don't want the drive hiding the fact that you had to retry 10 times
> to read a sector. Sure smartctl can track this kind of thing, strangely
I should make it a habit to read SMART trend report for my drive population.
> hardware RAID controllers often hide that info from the operating system.
> Basically for a raid you want a yes you have this block or no you don't have a
> block within a fairly low time windows. Especially in the gruesome case of a
> manual rebuild where you don't want the marginal sectors sending your drive
> into la la land preventing you from getting the perfectly healthy blocks off.
> It all comes down to it's easier to deal with a sorry, can't get that block
> within 50ms then handle a drive that disappears for 10's of seconds at a time.
> The kind of nightmare scenarios I've seen is a 16 disk array bit rot starts,
> the array looks perfect, but of course the number of invisible retries starts
> increasing. If you are using a pathetically old kernel (like say the standard
> RHEL kernel) you don't have ECC scrubbing. Then of course a drive drops, you
Apropos scrubbing, is chipkill worth it? Some AMD systems I've seen have ECC buffered
DIMMs with chipkill.
> go to rebuild, then a 2nd drive hits an error (that has been silent till now).
> Then you are in a position where you want to scan all drives and hope that
> the errors that you find are not aligned with the errors on other drives.
> With RAID edition drives you can do such a rebuild in a reasonable amount of
> time, with desktop drives, even one that is 99% good blocks can lead to very
> high rebuild times.
I'm aware of the problem, and looking at FreeNAS 0.7 (currently pre-alpha)
with scrubbing and zfs/RAID-Z for self-healing.
> I'm guessing that when a 120MB/sec consumer drive is providing 20-30MB/sec
> that it's service life is shortened, but I've no numbers to back that up. In
> the same conditions a raid edition drive provided 75MB/sec or so with or
> without vibration.
As another anecdote, I had 7200.11 TByte line perform awfully on DB-like tasks,
and a lot of issues reported by SMART and failures during use (one RAID 1 failed
to rebuild since the second drive died during reconstruction).
> Manufacturers are starting to mention the number of drives in a RAID... they
> seem to be differentiating between single drive, 2-4 drive arrays, and larger.
More information about the Beowulf