[Beowulf] Re: GPU boards and cluster servers.

Tue Sep 9 15:41:01 PDT 2008

>> I _do_ wish it was a bit more common to have onsite spares.  not sure
>> why vendors (HP at least) don't like to do this.  maybe just that it
>> might
>> get kicked around or otherwise abused...
>
> You don't have your own spares kit? For big clusters like yours, it
> doesn't cost much.

could be we don't know how to ask; I'm not aware of HP actually 
offering such a kit.  or how much we'd be willing to pay.

it is an interesting question: not just how much does downtime cost you,
but what are the kinds of failures you see and expect?  our clusters
have been remarkably robust, in spite of having pretty mundane hardware.
plain old sata disks, for instance.  we have several instances (sites)
with a ~400 disk filesystem, but I think we're around 1-2% annual failure
rate.  we use raid6, but spares for those disks are the most obvious 
thing I'd want.  the failure rate for PSU's, motherboards, dimms, etc
are quite a lot lower (maybe 2 psu's of 768 nodes per year.)

OTOH, most of this hardware is approaching its third birthday.  magic 
warranty-related number there :|