Heat pipes?

Robert G. Brown rgb at phy.duke.edu
Tue Jun 5 07:42:19 PDT 2001


On Mon, 4 Jun 2001, Patrick Lesher wrote:
> On Mon, 4 Jun 2001, Eddie Lehman wrote:
> > From: "Patrick Lesher" <pl25 at ibb.gatech.edu>
> > > On Mon, 4 Jun 2001 alvin at Mail.Linux-Consulting.com wrote:
> > > > On Mon, 21 May 2001, Donald B. Kinghorn wrote:

One thing to remember in this whole discussion is that beowulfs and most
sensibly designed clusters are supposed to be COTS entities.
Rackmounting nodes is ok, as racks and cases are COTS components.
Standard (mid, mini, whatever) towers and shelves are definitely ok and
were the original beowulf paradigm, suitable when per-node cost of the
marginal space difference isn't more expensive than the per-node cost of
racks.

However, when you start talking about handmade immersion cooling
systems, custom chilled-water-cooled nodes, handmade liquid-cooled
heatsinks, you're pushing the COTS boundary, unless you can find a
source of units that can just be bought and attached in place of a
regular heat sink/fan.  You're arguably still within the beowulf
paradigm on the very high-end (where you are emulating the design
features of some of the various very expensive big-iron boxes of years
past while still using COTS components as the primary basis of design)
BUT you're also in a very narrow region of cost-benefit sanity.

Consider the cost-benefit.  The more handiwork you (or anybody else,
e.g. the vendor) has to put into node assembly the higher the cost.  A
fair estimate of the retail value of the time is $100/hour for labor in
the highly skilled labor marketplace, although of course you are welcome
to discount your own/local opportunity cost labor to any value you
choose.  If the use of custom heat sinks costs as little as one hour of
work per node (and the liquid cooling systems sound like they'd almost
certainly cost more than one hour of labor each unless/until they can be
purchased as COTS components and just stuck on the CPUs during a
standard assembly process) then this adds $100 per CPU to your node cost
in addition to the cost of the custom cooling component itself.  If it
is handmade (as a number of the proposed solutions in this thread were)
the labor cost can double or quadruple to $200-400 per CPU with another
$50-100 on top of that for the materials used to build the sinks and
radiators.  Let us conservatively assume an additional cost of $50 for
the hardware and $100 (plus "an hour" of somebody's time) for the labor
per node.

Now, this methodology only makes sense at all for very high density,
very high node count clusters -- 256 "hot" CPUs packed into 128 1U
cases, for example (assuming that such a thing is physically possible
with some liquid cooled design) -- or for overclockers.  I'm religiously
opposed to overclocking in a production cluster and will say no more of
this.  If you aren't overclocking and are only assembling 16 CPUs in
8-16 nodes, there are far simpler and cheaper COTS solutions, such as
using decent size tower cases with an extra fan on shelves in a space
with adequate air conditioning, which will save roughly $200/node in
node cost anyway, or 2U or even (if necessary) 4U cases in a rackmount
setup, also with an extra fan and/or a larger "normal" heatsink in a
properly air conditioned space.

Using a handmade chilled water liquid-cooled schema for 128 nodes at one
hour of extra labor per node (if not per CPU, which might double it)
would cost more than three work-weeks of labor.  Three weeks of opening
nodes, attaching the cooling components to the CPUs (carefully, since a
bad job will be very difficult to repair later), sweatsoldering copper
tubing and running it out the back of the nodes with suitable fittings,
wrapping it in insulation inside the nodes to prevent condensation from
forming and dripping onto your motherboard, and pressure testing the
tubing before actually racking up the nodes.  Just distributing the
chilled water to all the nodes is a fairly significant problem in
engineering and will be quite expensive.  It also will leave you with a
rackmounted setup where nodes are most definitely NOT going to be easy
to replace, repair, or maintain.

This >>might<< make sense for a really big "permanent" cluster for a big
government or corporate research center, although I suspect that even
for these clusters there are cheaper and more robust (and saner)
alternatives.  You've certainly gone a long way from the original
beowulf off-the-shelf concept, and you will pay certain penalties for
doing so -- node repair and replacement and upgrading the cluster all
carry heavy cost-penalties beyond "just" the cost of the hardware and a
bit of labor on a per-node basis.  Your operation just doesn't scale as
well, and this sort of handiwork only makes sense for clusters that
will be in use for a long time (to justify the extra month or two it
takes to build them).

It also might make sense for a turnkey company to utilize a design like
this for these customers and pre-solve the relevant engineering
questions in such a way that they can build the nodes with economy of
scale and robustness of design.  And of course it is always reasonable
for the "hobbyist" cluster even on a small scale -- if it gives you joy
to build liquid cooling systems for your computer cluster, then by all
means do so.  I personally would rather have the extra node or two a
heavily customized cooling system would cost and would rather save the
time, even as a hobbyist.  Sticking an extra fan into each node is
relatively cheap, and moving cooled air around to pump into the nodes'
chassis via that fan is standard technology.

Air may not be the most efficient cooling medium available, but it is
plentiful and cheap and in many cases (sorry;-) is good enough.

Ultimately, it pays to remember what you are really buying with all this
effort and money.  That is >>perhaps<< a year's relative speed advantage
for your cluster.  I have these lovely 18 month old dual 400 MHz PIII's
with heat sinks the size of a small toaster (occupied volume of 250-500
cc's) and extra cooling fans.  Then I have eight month old 800 MHz
Athlons with heat sinks that are pretty "normal" in size (occupied
volume of 50-100 cc's), and one month old 1.33 GHz Athlons ditto.  The
way Moore's Law works, one could literally spend 2-3 months engineering
a custom cooling systems for the processor du jour in an immensely
packed configuration only to be confronted, three months into the use of
the cluster, with 30% faster processors available that actually consume
less power that fit into the same standard cases with no problem.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu







More information about the Beowulf mailing list