[Beowulf] distributing storage amongst compute nodes

Mon Oct 22 12:11:48 PDT 2007

On Mon, 22 Oct 2007, Leif Nixon wrote:

>> Even though the data shows otherwise?
>
> OK, let me rephrase: That's just true in general, and it's the
> specific cases that bite you.
>
> On *average* disks might not be a large source of problems, but when
> you run into a bad batch it hurts. But that's true for all components,
> of course.
>
> My take on this might be a bit coloured from spending some time the
> other day in the company of a representative sample of disks from a
> certain bad batch, a crowbar, a sledgehammer and several pounds of
> thermite. (Sorry, rgb, no sucker rod available around here.)

Crowbars or hammers are acceptable substitutes, especially when applied
liberally to offending parties, be they human or electronic.

Again, I agree precisely with this.  Over way too many years now to
comfortably think back on, I've had hmmm, "a lot" of systems packed with
"a lotter" of disks.  I've seen disks go from where only SCSI were
(more) reliable and we'd lose 50 to 60% of our IDE disks over a 2-3 year
period not through a batch phenomenon (completely different mfrs) but
because they just plain sucked.  I was looking for bugs in the linux
kernel that caused drives to break somehow, it was so pervasive.  Then
I've gone years without a failure.

There was a real downturn a few years back when they tried to drop
warranties on non-SCSI drives back down to a year (where for a long time
indeed it was 3 to scsi's 5), but fortunately consumers largely rejected
that one.

It may well be true that on average, modern disks are back up in the 3-5
year range even when they're not SCSI.  Still, I don't have to think
back over a very long time to remember disk failures.  In fact, I've got
a downed disk in a RAID at home right now.  RAID tends to hammer disks
and wear them out faster compared to a disk on a standalone node or
desktop that mounts its primary workspace from NFS and runs most
disk-based stuff out of memory (cache or buffer) and is used by just one
person.

So for certain cluster designs, local disks are hardly ever used except
to boot the box up and load its shared libraries into memory.  Shared
disk is all RAID.  In that context, yeah, disks can last a long time and
not cause much downtime.

On a desktop or laptop or on the RAID server, things are different.
Disks get hammered on a server, and are used a lot more on interactive
systems with no remote disk resource.  They usually have cheaper disks,
as well, and crash fairly regularly (or are taken down for maintenance
without a crash) due to failed disks.

And then, as Leif notes, there are bad batches.  "Fell off the truck"
has a whole new meaning when it describes a batch of disks...

    rgb

-- 
Robert G. Brown
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone(cell): 1-919-280-8443
Web: http://www.phy.duke.edu/~rgb
Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977