[Beowulf] Power Cycling Question

Sat Jul 17 00:44:32 UTC 2021

One problem with suspend/sleep is if you have services that depend on
persistent TCP connections. I don't know that GPFS (er, sorry, "Spectrum
Scale"), for instance, would be consistently tolerant of its daemon
connections being interrupted, even if the node in question wasn't actually
doing any I/O.

We had tried engineering our custom "green cluster" automation with Grid
Engine years ago where we would shutdown idle nodes until they were needed,
but doing it independently of the resource manager was far too complicated
for us to maintain, especially since it was all cost and no benefit for us
with our power and cooling charges being absorbed through a flat overhead
rate. This might not be as big of an issue for schedulers/resource managers
that have fewer requestable resources than GE, and for sites where they
are billed for actual power/cooling used and can more easily justify the
staff time to manage the extra complexity.

On Sat, Jul 17, 2021 at 12:43:27AM +0100, Jörg Saßmannshausen wrote:
> Hi Doug,
> 
> interesting topic and quite apt when I look at the flooding in Germany, 
> Belgian and The Netherlands. 
> 
> I guess there are a number of reasons why people are not doing it. Discarding 
> the usual "we never done that" one, I guess the main problem is: when do you 
> want  to turn it off? After 5 mins being idle? Maybe 10 mins? One hour? How 
> often do you then need to boot them up again and how much energy does that 
> cost? From chatting to a few people who tried it in the past it somehow 
> transpired that you do not save as much energy as you were hoping for. 
> 
> However, on thing came to my mind: is it possible to simply suspend it to disc 
> and then let it be sleeping? That way, you wake the node up quicker and 
> probably need less power when it is suspended. Think of laptops. 
> 
> The other way around would simply be: we know in say the summer, there is less 
> demand so we simply turn X number of nodes off and might do some maintenance 
> on them. So you are running the whole cluster for say 6 weeks with limited 
> capacity. That might mean a few jobs are queuing but that also will give us a 
> window to do things. Once people are coming back, the maintenance is done and 
> the cluster can run at full capacity again. 
> 
> Just some (crazy?) ideas.
> 
> All the best
> 
> Jörg
> 
> Am Freitag, 16. Juli 2021, 20:35:11 BST schrieb Douglas Eadline:
> > Hi everyone:
> > 
> > Reducing power use has become an important topic. One
> > of the questions I always wondered about is
> > why more cluster do not turn off unused nodes. Slurm
> > has hooks to turn nodes off when not in use and
> > turn them on when resources are needed.
> > 
> > My understanding is that power cycling creates
> > temperature cycling, that then leads to premature node
> > failure. Makes sense and has anyone ever studied/tested
> > this ?
> > 
> > The only other reason I can think of is that the delay
> > in server boot time makes job starts slow or power
> > surge issues.
> > 
> > I'm curious about other ideas or experiences.
> > 
> > Thanks
> > 
> > --
> > Doug
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf

-- 
Skylar