[Beowulf] Cooling vs HW replacement
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Luc Vereecken Luc.Vereecken at chem.kuleuven.ac.beTue Jan 18 05:41:36 PST 2005
- Previous message: [Beowulf] Cooling vs HW replacement
- Next message: [Beowulf] Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi list, I usually just lurk on this mailinglist, but i think it time to share some experience about not having Cooling... I have been running a cluster (of variable size depending on the season) in an average room without AC for several years. Not by choise, I must say, but my request for AC was rejected, and it took years before the necessary infrastructure was present to move to another room that already had AC installed. It is a horrorstory. During summer (yes, I'm on the sunny side of the building) temperatures in excess of 35° Celcius. During winter, even with the (small) window open while it was freezing outside, I couldn't get the temperature below 20°C. I just could not get rid of the generated heat, despite that this is a chemistry building and the ventilation replaces the air 7 times per hour (or is it 15 times? can't remember). Note that other rooms in this part of the building tend to be chilly in winter because it's so hard to heat them with the ventilation taking out the heated air. The first summer I had a failure rate of over 60%. Some motherboards failed, plenty of powersupplies failed, I had 10 brandnew disks that ran so hot at times i couldn't put my hand on them at these ambient temperatures. 5 of them failed in the first 6 months, the other 5 a few months later. Some CPUs just stopped working. Some memory modules burned out. I have 2 or 3 nodes where i can reproducibly crash certain jobs or get faulty results just depending on the temperature of the room. I found that during the hot season, new computers ran for about 3 months, then started to go awry. In an attempt to get rid of the hot air, I attached flexible airducts to the exhaust of the powersupplies (where most of the hot air comes out) and the ventilation sucked the hot air out directly. This idea actually works pretty well for a DIY solution, especially as we have uber-ventilation given that this room used to be a chemical lab. I might actually implement this also for our new cluster (in an AC-ed room) just to reduce the AC-requirements. On average, it reduced the temperature in the room several degrees, but I had to let go of the idea because I still had to fix nodes too often and handling the airducts was a bit too cumbersome to do on daily basis (some nodes are still attached to this system). The next summer, I wisened up, and preemtively turned off some of the slower nodes. This time failure rate of the remaining nodes was _only_ 45%, but I think this is partly because I just stopped fixing nodes at a certain point (ran out of spare powersupplies )-: ) and left the faulty nodes turned off waiting for the colder season. After more than two years, I now have access to an AC-ed room; I plan on building a completely new cluster, as all the current hardware has been overheating and is prone to produce faulty results because of this despite that the room is cooler now than in summer. My advice: don't even think about trying HW replacement instead of cooling. - Failure rates are horrible at temperatures above 30° ambient: I lost thousands of euros by failures. - Downtime is also killing you: my scientific output has dropped to less than 30% compared to before these heating problems started. With a bit of bad luck, I won't get a new grant because of this. - TCO is horrible due to the operator time: you have to manually walk over to the cluster, take out the node, figure out the problem, fix it, get spare parts or contact the warranty supplier,... this takes too much time especially given the high failure rates. I never worked as hard as during these last 2 years, and as I said, scientific output was still strongly reduced. - You will have difficulty with warranty. Some components you can get replaced without questions as they show no obvious signs of abuse, but I had MoBos with several components blown and blackened. No way you can claim this is not caused by overheating. After you send in a couple of these, you start to get questions... - The biggest killer of all is the non-visible problems. At certain moments I started to get different results on different nodes for the same job. A job would crash at 28°, but run OK at 24°. I get unexplainable outliers when calculating what should be a smooth trend. Rerunning the calculations exactly the same gives different results. You just cant trust your results anymore. OK. Better stop here, as i don't intend to rival rgb in length of a post :-) Anyway: going the HW-replacement road would be, in my view and based on extensive experience, a wrong decission. Luc Vereecken
- Previous message: [Beowulf] Cooling vs HW replacement
- Next message: [Beowulf] Cooling vs HW replacement
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
