[Beowulf] immersion

Scott Atchley e.scott.atchley at gmail.com
Sun Mar 24 19:59:04 UTC 2024

On Sun, Mar 24, 2024 at 2:38 PM Michael DiDomenico <mdidomenico4 at gmail.com>

> thanks, there's some good info in there.  just to be clear to others that
> might chime in i'm less interested in the immersion/dlc debate, then
> getting updates from people that have sat on either side of the fence.
> dlc's been around awhile and so has immersion, but what i can't get from
> sales glossy's is real world maintenance over time.
> being in the DoD space, i'm well aware of the HPE stuff, but they're also
> whats making me look at other stuff.  i'm not real keen on +100kw racks,
> there are many safety concerns with that much amperage in a single
> cabinet.  not to mention all that custom hardware comes at stiff cost and
> in my opinion doesn't have a good ROI if you're not buying 100's of racks
> worth of it.  but your space constrained issue is definitely one i'm
> familiar with.  our new space is smaller then i think we should build, but
> we're also geography constrained.
> the other info i'm seeking is futures, DLC seems like a right now solution
> to ride the AI wave.  i'm curious if others think DLC might hit a power
> limit sooner or later, like Air cooling already has, given chips keep
> climbing in watts.  and maybe it's not even a power limit per se, but DLC
> is pretty complicated with all the piping/manifolds/connectors/CDU's, does
> there come a point where its just not worth it unless it's a big custom
> solution like the HPE stuff

The ORv3 rack design's maximum power is the number of power shelves times
the power per shelf. Reach out to me directly at <my first name> @ ornl.gov
and I can connect you with some vendors.

> On Sun, Mar 24, 2024 at 1:46 PM Scott Atchley <e.scott.atchley at gmail.com>
> wrote:
>> On Sat, Mar 23, 2024 at 10:40 AM Michael DiDomenico <
>> mdidomenico4 at gmail.com> wrote:
>>> i'm curious to know
>>> 1 how many servers per vat or U
>>> 2 i saw a slide mention 1500w/sqft, can you break that number into kw
>>> per vat?
>>> 3 can you shed any light on the heat exchanger system? it looks like
>>> there's just two pipes coming into the vat, is that chilled water or oil?
>>> is there a CDU somewhere off camera?
>>> 4 that power bar in the middle is that DUG custom?
>>> 5 any stats on reliability?  like have you seen a decrease in the hw
>>> failures?
>>> are you selling the vats/tech as a product?  can i order one? :)
>>> since cpu's are pushng 400w/chip, nvidia is teasing 1000w/chip coming in
>>> the near future, and i'm working on building a new site, i'm keenly
>>> interested in thoughts on DLC or immersion tech from anyone else too
>> As with all things in life, everything has trade-offs.
>> We have looked at immersion at ORNL and these are my thoughts:
>> *Immersion*
>>    - *Pros*
>>       - Low Power Usage Efficiency (PUE) - as low as 1.03. This means
>>       that you only spend $0.03 per dollar to cool a system for each $1.00 that
>>       the system consumes in power. In contrast, air-cooled data centers can
>>       range from 1.30 to 1.60 or higher.
>>       - No special racks - can install white box servers and remove the
>>       fans.
>>       - No cooling loops - no fittings that can leak, get kinked, or
>>       accidentally clamped off.
>>       - No bio-growth issues
>>    - *Cons*
>>    - Low power density - take a vertical rack and lay it sideways. DLC
>>       allows the same power density with the rack being vertical.
>>       - Messy - depends on the fluid, but oil is common and cheap. Many
>>       centers build a crane to hoist out servers and then let them drip dry for a
>>       day before servicing.
>>       - High Mean-Time-To-Repair (MTTR) - unless you have two cranes,
>>       you cannot insert a new node until the old one has dripped dry and been
>>       removed from the crane.
>>       - Some solutions can be expensive and/or lead to part failures due
>>       to residue build up on processor pins.
>> *Direct Liquid Cooling (DLC)*
>>    - *Pros*
>>       - Low PUE compared to air-cooled. Depends on how much water
>>       capture. Summit uses hybrid DLC (water for CPUs and GPUs and air for DIMMs,
>>       NICs, SSDs, and power supply) with ~22°C water. Summit's PUE can range from
>>       1.03 to 1.10 depending on the time of year. Frontier, on the other hand, is
>>       100% DLC (no fans in the compute racks) with 32°C water. Frontier's PUE can
>>       range from 1.03 to 1.06 depending on the time of year. Both PUEs include
>>       the pumps for the water towers and to move the water between the Central
>>       Energy Plant and the data center.
>>       - High power density - the HPE Cray EX 4000 "cabinet" can supply
>>       up to 400 KW and is equivalent in space to two racks (i.e., 200 KW per
>>       standard rack). If your data center is space constrained, this is a crucial
>>       factor.
>>       - No mess - DLC with Deionized water (DI water) or with Propylene
>>       Glycol Water (PGW) systems use dripless connectors.
>>       - Low MTTR - remove a server and insert another if you have a
>>       spare.
>>    - *Cons*
>>       - Special racks - HPE cabinets are non-standard and require HPE
>>       designed servers. This is changing. I saw many examples of ORv3 racks at
>>       GTC that use the OCP standard with DLC manifolds.
>>       - Cooling loops - Loops can leak at fittings, be kinked, or
>>       crimped that restricts flow and cause overheating. Hybrid loops are simpler
>>       while 100% DLC loops are more complex (i.e., expensive). Servers tend to
>>       include drip sensors to detect this, but we have found that the DIMMs are
>>       better drip sensors (i.e., the drips hit them before finding the drip
>>       sensor). 😆
>>       - Bio-growth
>>          - DI water includes biocides and you have to manage it. We have
>>          learned that no system can be bio-growth free (e.g., inserting a blade will
>>          recontaminate the system). That said, Summit has never had any
>>          biogrowth-induced overheating and Frontier has gone close to nine months
>>          without overheating issues due to growth.
>>          - PGW systems should be immune to any bio-growth but you lose
>>          ~30% of the heat removal capacity compared to DI water. Depending on your
>>          environment, you might be able to avoid trim water (i.e., mixing in chilled
>>          water to reduce the temperature).
>>       - Can be expensive to upgrade the facility (i.e., to install
>>       evaporative coolers, piping, pumps, etc.).
>> For ORNL, we are space constrained. For that alone, we prefer DLC over
>> immersion.
>> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20240324/db525e07/attachment.htm>

More information about the Beowulf mailing list