[Beowulf] Re: GPU boards and cluster servers.
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Mike Davis jmdavis1 at vcu.eduTue Sep 9 16:10:33 PDT 2008
- Previous message: [Beowulf] Re: GPU boards and cluster servers.
- Next message: [Beowulf] Re: GPU boards and cluster servers.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> > could be we don't know how to ask; I'm not aware of HP actually > offering such a kit. or how much we'd be willing to pay. > > it is an interesting question: not just how much does downtime cost you, > but what are the kinds of failures you see and expect? our clusters > have been remarkably robust, in spite of having pretty mundane hardware. > plain old sata disks, for instance. we have several instances (sites) > with a ~400 disk filesystem, but I think we're around 1-2% annual failure > rate. we use raid6, but spares for those disks are the most obvious > thing I'd want. the failure rate for PSU's, motherboards, dimms, etc > are quite a lot lower (maybe 2 psu's of 768 nodes per year.) > > OTOH, most of this hardware is approaching its third birthday. magic > warranty-related number there :| > _______________________________________________ The last I checked neither HP nor Sun offered a spares kit. Apple does. I have built my own for the Sun's (at least those that are not under NBD support). Disks are the obvious thing. We have had several (3) Power Supplies fail and in one case we had both bad RAM and a bad daughter board in a 4600. This is with machines that range from the v60 (PIV 3.0ghz) to the v20z(2.4ghz opteron), to the x4100(2.6ghz opteron) to the x2200(2.6ghz opteron) to the x4600(2.8ghz opteron). We have also had a couple of issues with MB's (actually usually a builtin controller on the MB, often SCSI ). Our real problem now is heat related failures on our 2004 v20z machines. These machines run hot, our raised floor machine room can't keep the upper nodes in the rack cool and we have had several failures that appear cpu/MB related in the past year after several heat events. I am hoping to get these machines replaced as soon as possible since the use as much power and require as much cooling as newer machines with dual dualcore or dual quadcore processors. NBD support can make sense for certain systems (particularly systems that are managed for another department). I like to have it and some spares for my machines. Mike
- Previous message: [Beowulf] Re: GPU boards and cluster servers.
- Next message: [Beowulf] Re: GPU boards and cluster servers.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
