hard disk reliability

Douglas Eadline deadline@plogic.com
Thu, 3 Jun 1999 11:04:00 -0400


On Thu, 3 Jun 1999, Rob Ross wrote:
I would agree. Here is the order of failures/problems after a system 
is burned in (this is what we have seen):

1. power supplies (general failures)
2. hard drives (due to shipping)
3. hard drives (general failures)
4. motherboards failing
5. NICs going hay-wire
6. Cable problems
7. Switch problems

Building systems we have seen (in order of occurance):

1. bad SDRAM (way too much than we care to think about)
2. bad IDE drives
3. bad Motherboards
4. bad SCSI cables, floppies, NICs

BTW: we have found that early PII-400s had problems with 
Linux SMP. After eliminating everything else, we found that 
replacing the CPUs (with PIII-450) solved the problem. The problem
included random crashes, wrong answers, and stalled MPI/PVM runs. It only
happened when the system is under high load with lots of interrupts. Goes
away if the FSB is set to 66MHz. 


Doug

> Actually, I have found that power supplies have been the least reliable
> components of our systems.
> 
> Rob Ross
> Parallel Architecture Research Lab, Clemson University
> 
> On Thu, 3 Jun 1999, Christoph Wasshuber wrote:
> 
> > Some days ago someone mentioned that one of
> > the big benefits of running a diskless cluster
> > is the increased reliability. Hard disks are
> > the most unreliable part in PCs. Does anybody
> > have manufacturer numbers like MTBF (mean time
> > between failure)?
> > 
> > I would also be interested in comments from
> > people running beowulfs with 100 or more
> > nodes, where every node has a hard disk. Do
> > you guys exchange a hard disk every month?
> > Or even every week?
> > 
> > How serious is the hard disk reliability issue
> > in reality?
> > 
> > Chris....
> 

-------------------------------------------------------------------
Paralogic, Inc.           |     PEAK     |      Voice:+610.861.6960
115 Research Drive        |   PARALLEL   |        Fax:+610.861.8247
Bethlehem, PA 18017 USA   |  PERFORMANCE |    http://www.plogic.com
-------------------------------------------------------------------