[Beowulf] Re: failure trends in a large disk drive population
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jim Lux James.P.Lux at jpl.nasa.govFri Feb 16 14:15:49 PST 2007
- Previous message: [Beowulf] Re: failure trends in a large disk drive population
- Next message: [Beowulf] Re: failure trends in a large disk drive population
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
At 12:50 PM 2/16/2007, David Mathog wrote: >Eugen Leitl <eugen at leitl.org> wrote: > > > http://labs.google.com/papers/disk_failures.pdf > >Interesting. However google apparently uses: > > serial and parallel ATA consumer-grade hard disk drives, > ranging in speed from 5400 to 7200 rpm > >Not quite clear what they meant by "consumer-grade", but I'm assuming >that it's the cheapest disk in that manufacturer's line. I don't >typically buy those kinds of disks, as they have only a 1 year >warranty but rather purchase those with 5 year warranties. But this is potentially a very interesting trade-off, and one right in line with the Beowulf concept of leveraging cheap consumer gear... Say you need 100 widgets worth of horsepower. Are you better off buying 103 pro widgets at $500 and a 3% failure rate or 110 consumer widgets at $450 and a 10% failure rate.... $51.5K vs $49.5K... the cheap drives win.. And, in fact, if the drives fail randomly during the year (not a valid assumption in general, but easy to calculate on the back of an envelope), then you actually get more compute power with the cheap drives (105 average vs 101.5 average over the year) This also assumes that the failure rate is "small" and "independent" (that is, you don't wind up with a bad batch that all fail simultaneously from some systemic flaw.. the bane of a reliability calculation) One failing I see of many cluster applications is that they are quite brittle.. that is, they depend on a particular number of processors toiling on the task, and the complement of processors not changing during the "run". But this sort of thing makes a 100 node cluster no different than depending on the one 100xspeed supercomputer. I think it's pretty obvious that Google has figured out how to partition their workload in a "can use any number of processors" sort of way, in which case, they probably should be buying the cheap drives and just letting them fail (and stay failed.. it's probably cheaper to replace the whole node than to try and service one)... James Lux, P.E. Spacecraft Radio Frequency Subsystems Group Flight Communications Systems Section Jet Propulsion Laboratory, Mail Stop 161-213 4800 Oak Grove Drive Pasadena CA 91109 tel: (818)354-2075 fax: (818)393-6875
- Previous message: [Beowulf] Re: failure trends in a large disk drive population
- Next message: [Beowulf] Re: failure trends in a large disk drive population
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
