[Beowulf] Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduMon Jan 17 10:44:15 PST 2005
- Previous message: [Beowulf] Cooling vs HW replacement
- Next message: [Beowulf] Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Sun, 9 Jan 2005, Ariel Sabiguero wrote: > Hello all. > The following question shall only consider costs, not uptime or > reliability of the solution. > I need to balance costs of hardware replacement after failures over air > conditioning costs. > The question arises as most current hardware comes with 3 or more years > of warranty. During that period of time Moore twofolded twice hardware > performance... is it worth spending money cooling down a cluster or just > rebuilding it after it "burns out" (and is at least 4 times slower than > state-of-the art)? > Is it worth cooling down the room to a Class A Computer room standard or > save the money for hardware upgrade after three years? In warm countries > keeping 18ºC the air inside a room (PC-heated) when outside temperature > is 30ºC average it becomes pretty expensive to pay electricity bills. It > is cheaper to "circulate" 30ºC air and have from 40-50ºC inside the chassis. If you circulate 30C air, and have 50+C air inside the chassis, the CPU and memory chips themselves will be at least 10 and more likely 30 or 50 C hotter than that. This will really significantly reduce the lifetime of the components. There is a rule of thumb that every 10F (4.5 C) hotter ambient air temp reduces expected lifetime by a year. You're talking about running some 3x10 F degrees hotter than optimal for a 4+ year lifetime. This could easily reduce the MTBF for your nodes to 1-2 years. However, this "lifetime" thing is going to be highly irregular. All chips are not equal. Some subsystems, especially memory, will flake out (give you odd answers, drop bits) if you habitually run them well above desireable ambient. Some will run for four months, flake out, then break at six months. Some will run for a year and pop. Some will make it to two years, and only a relatively small fraction of your cluster will make it to years 3-4. It is therefore not possible to address "only the costs" without addressing uptime and reliability. Downtime is expensive. Downtime due to a crash can cost you a week's worth of work for the entire cluster for some kinds of problems. Unreliable hardware is AWESOMELY expensive, I know from bitter, personal experience. In addition to the associated downtime, there is all sorts of human time associated with going into the cluster every week or two to pull a downed node, work with it (sometimes for a full day) with spares to identify the blown components, order and replace the blown component, and get it back up. A minimum of say 4 hours per event, and as much as 2-3 DAYS if something isn't broken but is just too flaky -- the system crashes (because of memory running too hot) but it reboots fine when it is cooled and you can't identify a "bad chip" because there isn't one, technically, except when it is under load AND being "cooled" by hot air. Time costs money -- generally more money than either the hardware OR the air conditioning. Besides, AC costs are still only about 1/3 the costs of powering the nodes up themselves as a running expense (depending on the COP of your cooling system, assuming a COP of 3-4). The rest is infrastructure investment in building a properly cooled facility. I'd say make the investment. BTW, you might well find that hardware salespersons will balk at replacing the equipment they sell you under extended service if you don't maintain the recommended ambient air. So you might end up having to pay for a constant stream of hardware out of pocket in addition to the labor and downtime. I just don't think it is worth it. rgb > > Do you have figures or graphs plotting MTBF vs temperature for main > system components (memory, CPU, mainboard, HDD) ? > Links to this information are highly appreciated! > I remember old (40MB RLL disks shipping this information with the > device, several pages of printed manual) hardware showing the > difference in MTBF vs environment conditions, but nowadays commodity > harware does not consider this on the sticker on the top of the device... > > Regards > > Ariel > > PS: if the idea is worth the money, then I would like to study > reliability and uptime, but it is not the main concern now. > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > -- Robert G. Brown http://www.phy.duke.edu/~rgb/ Duke University Dept. of Physics, Box 90305 Durham, N.C. 27708-0305 Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
- Previous message: [Beowulf] Cooling vs HW replacement
- Next message: [Beowulf] Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
