[Beowulf] [External] Power Cycling Question

Mon Jul 19 16:12:24 UTC 2021

Doug,

I don't think thermal cycling is as big of an issue as it used to be. 
 From what I've always been told/read, the biggest problem with thermal 
cycling was "chip creep", where the expansion/contraction of a chip in a 
socket would cause the chip to eventually work itself loose enough to 
cause faulty connections. 20+ years ago, I remember looking at 
motherboards with chips inserted into sockets. On a modern motherboard, 
just about everything is soldered to the motherboard, except the CPU and 
DIMMs. The CPUs are usually locked securely into place so chip creep 
won't happen, and the DIMMs have a latching mechanism, although anyone 
who has every reseated a DIMM to fix a problem knows that mechanism 
isn't perfect.

As someone else has pointed out, components with moving parts, like 
spinning disks are at higher risk of failure. Here, too, that risk is 
disappearing, as SSDs are becoming more common, with even NVMe drives 
available in servers.

I know they there is a direct relationship between system failure and 
operating temperature, but I don't know if that applies to all 
components, or just those with moving parts. Someone  somewhere must 
have done research on this. I know Google did research on hard drive 
failure that was pretty popular. I would imagine they would have 
researched this, too.

As an example, when I managed an IBM Blue Gene/P, I remember IBM touting 
that all the components on a node (which was only the size of a PCI 
card) were soldered to the board - nothing was plugged into a socket. 
This was to completely eliminate chip creep and increase reliability. 
Also, the BG/P would shutdown nodes between jobs, just as your asking 
about here. If there was another job waiting in the queue for those 
nodes, the nodes would at least reboot between every job.

I do have to say that even though my BG/P was small for a Blue Gene, it 
still had 2048 nodes, and given that number of nodes, I had extremely 
few hardware problems at the node-level, so there's something to be said 
for that logic. I did, however, have to occasionally reseat a node into 
a node card, which is the same as reseating a DIMM or a PCI card in a 
regular server.

Prentice

7/16/21 3:35 PM, Douglas Eadline wrote:
> Hi everyone:
>
> Reducing power use has become an important topic. One
> of the questions I always wondered about is
> why more cluster do not turn off unused nodes. Slurm
> has hooks to turn nodes off when not in use and
> turn them on when resources are needed.
>
> My understanding is that power cycling creates
> temperature cycling, that then leads to premature node
> failure. Makes sense and has anyone ever studied/tested
> this ?
>
> The only other reason I can think of is that the delay
> in server boot time makes job starts slow or power
> surge issues.
>
> I'm curious about other ideas or experiences.
>
> Thanks
>
> --
> Doug
>