[Beowulf] immersion

Michael DiDomenico mdidomenico4 at gmail.com
Sun Mar 24 18:38:12 UTC 2024

thanks, there's some good info in there.  just to be clear to others that
might chime in i'm less interested in the immersion/dlc debate, then
getting updates from people that have sat on either side of the fence.
dlc's been around awhile and so has immersion, but what i can't get from
sales glossy's is real world maintenance over time.

being in the DoD space, i'm well aware of the HPE stuff, but they're also
whats making me look at other stuff.  i'm not real keen on +100kw racks,
there are many safety concerns with that much amperage in a single
cabinet.  not to mention all that custom hardware comes at stiff cost and
in my opinion doesn't have a good ROI if you're not buying 100's of racks
worth of it.  but your space constrained issue is definitely one i'm
familiar with.  our new space is smaller then i think we should build, but
we're also geography constrained.

the other info i'm seeking is futures, DLC seems like a right now solution
to ride the AI wave.  i'm curious if others think DLC might hit a power
limit sooner or later, like Air cooling already has, given chips keep
climbing in watts.  and maybe it's not even a power limit per se, but DLC
is pretty complicated with all the piping/manifolds/connectors/CDU's, does
there come a point where its just not worth it unless it's a big custom
solution like the HPE stuff

On Sun, Mar 24, 2024 at 1:46 PM Scott Atchley <e.scott.atchley at gmail.com>

> On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <
> mdidomenico4 at gmail.com> wrote:
>> i'm curious to know
>> 1 how many servers per vat or U
>> 2 i saw a slide mention 1500w/sqft, can you break that number into kw per
>> vat?
>> 3 can you shed any light on the heat exchanger system? it looks like
>> there's just two pipes coming into the vat, is that chilled water or oil?
>> is there a CDU somewhere off camera?
>> 4 that power bar in the middle is that DUG custom?
>> 5 any stats on reliability?  like have you seen a decrease in the hw
>> failures?
>> are you selling the vats/tech as a product?  can i order one? :)
>> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
>> the near future, and i'm working on building a new site, i'm keenly
>> interested in thoughts on DLC or immersion tech from anyone else too
> As with all things in life, everything has trade-offs.
> We have looked at immersion at ORNL and these are my thoughts:
> *Immersion*
>    - *Pros*
>       - Low Power Usage Efficiency (PUE) - as low as 1.03. This means
>       that you only spend $0.03 per dollar to cool a system for each $1.00 that
>       the system consumes in power. In contrast, air-cooled data centers can
>       range from 1.30 to 1.60 or higher.
>       - No special racks - can install white box servers and remove the
>       fans.
>       - No cooling loops - no fittings that can leak, get kinked, or
>       accidentally clamped off.
>       - No bio-growth issues
>    - *Cons*
>    - Low power density - take a vertical rack and lay it sideways. DLC
>       allows the same power density with the rack being vertical.
>       - Messy - depends on the fluid, but oil is common and cheap. Many
>       centers build a crane to hoist out servers and then let them drip dry for a
>       day before servicing.
>       - High Mean-Time-To-Repair (MTTR) - unless you have two cranes, you
>       cannot insert a new node until the old one has dripped dry and been removed
>       from the crane.
>       - Some solutions can be expensive and/or lead to part failures due
>       to residue build up on processor pins.
> *Direct Liquid Cooling (DLC)*
>    - *Pros*
>       - Low PUE compared to air-cooled. Depends on how much water
>       capture. Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs,
>       NICs, SSDs, and power supply) with ~22°C water. Summit's PUE can range from
>       1.03 to 1.10 depending on the time of year. Frontier, on the other hand, is
>       100% DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
>       range from 1.03 to 1.06 depending on the time of year. Both PUEs include
>       the pumps for the water towers and to move the water between the Central
>       Energy Plant and the data center.
>       - High power density - the HPE Cray EX 4000 "cabinet" can supply up
>       to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
>       standard rack). If your data center is space constrained, this is a crucial
>       factor.
>       - No mess - DLC with Deionized water (DI water) or with Propylene
>       Glycol Water (PGW) systems use dripless connectors.
>       - Low MTTR - remove a server and insert another if you have a spare.
>    - *Cons*
>       - Special racks - HPE cabinets are non-standard and require HPE
>       designed servers. This is changing. I saw many examples of ORv3 racks at
>       GTC that use the OCP standard with DLC manifolds.
>       - Cooling loops - Loops can leak at fittings, be kinked, or crimped
>       that restricts flow and cause overheating. Hybrid loops are simpler while
>       100% DLC loops are more complex (i.e., expensive). Servers tend to include
>       drip sensors to detect this, but we have found that the DIMMs are better
>       drip sensors (i.e., the drips hit them before finding the drip sensor). 😆
>       - Bio-growth
>          - DI water includes biocides and you have to manage it. We have
>          learned that no system can be bio-growth free (e.g., inserting a blade will
>          recontaminate the system). That said, Summit has never had any
>          biogrowth-induced overheating and Frontier has gone close to nine months
>          without overheating issues due to growth.
>          - PGW systems should be immune to any bio-growth but you lose
>          ~30% of the heat removal capacity compared to DI water. Depending on your
>          environment, you might be able to avoid trim water (i.e., mixing in chilled
>          water to reduce the temperature).
>       - Can be expensive to upgrade the facility (i.e., to install
>       evaporative coolers, piping, pumps, etc.).
> For ORNL, we are space constrained. For that alone, we prefer DLC over
> immersion.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240324/854656e5/attachment-0001.htm>

More information about the Beowulf mailing list