[Beowulf] [EXTERNAL] Power Cycling Question

Lux, Jim (US 7140) james.p.lux at jpl.nasa.gov
Sat Jul 17 00:38:01 UTC 2021

An interesting question.
The power cycling reliability thing is probably not a big deal - the temperatures change a lot between light load and heavy load already, and if a "server class" PC can't take a power cycle per day, when the grungiest consumer unit can do it, I'd be surprised. It's not like you're cycling between -40C and 70C every hour like in an automotive application. 

Managing the chillers, though - That might be a bigger problem. 

And as Jörg points out, there's a fair amount of sophistication needed in setting your turn on and turn off thresholds.

On the "spool RAM to disk" idea - That's sort of like checkpointing, and it can take surprisingly long, so there's another tradeoff there.

On 7/16/21, 12:35 PM, "Beowulf on behalf of Douglas Eadline" <beowulf-bounces at beowulf.org on behalf of deadline at eadline.org> wrote:

    Hi everyone:

    Reducing power use has become an important topic. One
    of the questions I always wondered about is
    why more cluster do not turn off unused nodes. Slurm
    has hooks to turn nodes off when not in use and
    turn them on when resources are needed.

    My understanding is that power cycling creates
    temperature cycling, that then leads to premature node
    failure. Makes sense and has anyone ever studied/tested
    this ?

    The only other reason I can think of is that the delay
    in server boot time makes job starts slow or power
    surge issues.

    I'm curious about other ideas or experiences.




    Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit https://urldefense.us/v3/__https://beowulf.org/cgi-bin/mailman/listinfo/beowulf__;!!PvBDto6Hs4WbVuu7!ef5Z3NxzUcVChBwMKSYQ9u5d4nI_weKdbvUWM6BY8x2UyBeye1j64LNSRzJZUkml3wOJ0TM$ 

More information about the Beowulf mailing list