[Beowulf] Re: GPU boards and cluster servers.

Mike Davis jmdavis1 at vcu.edu
Tue Sep 9 16:10:33 PDT 2008

> could be we don't know how to ask; I'm not aware of HP actually 
> offering such a kit.  or how much we'd be willing to pay.
> it is an interesting question: not just how much does downtime cost you,
> but what are the kinds of failures you see and expect?  our clusters
> have been remarkably robust, in spite of having pretty mundane hardware.
> plain old sata disks, for instance.  we have several instances (sites)
> with a ~400 disk filesystem, but I think we're around 1-2% annual failure
> rate.  we use raid6, but spares for those disks are the most obvious 
> thing I'd want.  the failure rate for PSU's, motherboards, dimms, etc
> are quite a lot lower (maybe 2 psu's of 768 nodes per year.)
> OTOH, most of this hardware is approaching its third birthday.  magic 
> warranty-related number there :|
> _______________________________________________

The last I checked neither HP nor Sun offered a spares kit. Apple does. 
I have built my own for the Sun's (at least those that are not under NBD 

Disks are the obvious thing. We have had several (3) Power Supplies fail 
and in one case we had both bad RAM and a bad daughter board in a 4600. 
This is with machines that range from the v60 (PIV 3.0ghz) to the 
v20z(2.4ghz opteron), to the x4100(2.6ghz opteron) to the x2200(2.6ghz 
opteron) to the x4600(2.8ghz opteron).

We have also had a couple of issues with MB's (actually usually a 
builtin controller on the MB, often SCSI ).

Our real problem now is heat related failures on our 2004 v20z machines. 
These machines run hot, our raised floor machine room can't keep the 
upper nodes in the rack cool and we have had several failures that 
appear cpu/MB related in the past year after several heat events. I am 
hoping to get these machines replaced as soon as possible since the use 
as much power and require as much cooling as newer machines with dual 
dualcore or dual quadcore processors.

NBD support can make sense for certain systems (particularly systems 
that are managed for another department). I like to have it and some 
spares for my machines.


