Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] Consumer vs. Enterprise Hard Drives in Clusters

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Eugen Leitl eugen at leitl.org
Sat Jan 24 01:49:46 PST 2009


On Fri, Jan 23, 2009 at 02:13:02PM -0800, Bill Broadley wrote:

> I've seen little correlation between weight and vibration.  After all even the
> built like a tank hardware is still noisy.

If yelling at a RAID array in a noisy center causes a latency peak obviously
the drives themselves are susceptible. The cover plate is thin, after all.

Another reason to look forward to SSDs.
 
> Just a delay between read/write and the answer.  Usually there is a timeout,
> after all a completely dead drive might never answer.

Does anyone know whether WDTLER.EXE http://en.wikipedia.org/wiki/Time-Limited_Error_Recovery
still works on modern non-RAID-Edition WD Green lines? The price difference is some
50 EUR for TByte drives.
 
> Well you don't want the drive hiding the fact that you had to retry 10 times
> to read a sector.  Sure smartctl can track this kind of thing, strangely

I should make it a habit to read SMART trend report for my drive population.

> hardware RAID controllers often hide that info from the operating system.
> Basically for a raid you want a yes you have this block or no you don't have a
>  block within a fairly low time windows.  Especially in the gruesome case of a
> manual rebuild where you don't want the marginal sectors sending your drive
> into la la land preventing you from getting the perfectly healthy blocks off.
> 
> It all comes down to it's easier to deal with a sorry, can't get that block
> within 50ms then handle a drive that disappears for 10's of seconds at a time.
> 
> The kind of nightmare scenarios I've seen is a 16 disk array bit rot starts,
> the array looks perfect, but of course the number of invisible retries starts
> increasing.  If you are using a pathetically old kernel (like say the standard
> RHEL kernel) you don't have ECC scrubbing.  Then of course a drive drops, you

Apropos scrubbing, is chipkill worth it? Some AMD systems I've seen have ECC buffered
DIMMs with chipkill. 

> go to rebuild, then a 2nd drive hits an error (that has been silent till now).
>  Then you are in a position where you want to scan all drives and hope that
> the errors that you find are not aligned with the errors on other drives.
> With RAID edition drives you can do such a rebuild in a reasonable amount of
> time, with desktop drives, even one that is 99% good blocks can lead to very
> high rebuild times.

I'm aware of the problem, and looking at FreeNAS 0.7 (currently pre-alpha)
with scrubbing and zfs/RAID-Z for self-healing. 
 
> I'm guessing that when a 120MB/sec consumer drive is providing 20-30MB/sec
> that it's service life is shortened, but I've no numbers to back that up.  In
> the same conditions a raid edition drive provided 75MB/sec or so with or
> without vibration.

As another anecdote, I had 7200.11 TByte line perform awfully on DB-like tasks, 
and a lot of issues reported by SMART and failures during use (one RAID 1 failed
to rebuild since the second drive died during reconstruction).
 
> Manufacturers are starting to mention the number of drives in a RAID... they
> seem to be differentiating between single drive, 2-4 drive arrays, and larger.

...
 



More information about the Beowulf mailing list