[Beowulf] delayed savings time crashes

David Kewley kewley at gps.caltech.edu
Wed Apr 12 12:57:30 PDT 2006

On Wednesday 12 April 2006 11:34, David Mathog wrote:
> Hmm, now that we know the cause of it that might explain
> why all those that did reboot were plugged into just 2 surge
> suppressors, where the loss was 9/10 machines, whereas the
> other 2 surge suppressors lost 0/10 machines.  Each surge
> suppressor is on its own circuit which is 1/3rd of a 3 phase line.
> Maybe only one phase had the glitch and by good luck the
> two circuits which lost no machines were wired between the
> two good phases?

I do not know how this worked, but I did see something similar but even 
stranger.  Our UPS feeds two PDUs, each responsible for about 1/2 the 
computers.  One PDU saw all computers on phases 1 & 2 fail, and the other 
saw all computers on phases 1 & 3 fail.  On both PDUs, the third, 
unaffected phase saw all its computers stay up.  I have no idea how to 
explain this.

> > This info comes from the responsible EE at Caltech.  As for its
> > effects, believe me, I know about it the hard way, as it took down 2/3
> > of our compute nodes, 1/3 of our disk shelves, and 3/4 of our
> > fileservers.
> That's a lot of machines in your case.  Did any sustain permanent
> damage?

It was a voltage drop rather than a spike, and that probably explains why we 
had no hardware damage.  Just quite a bit of filesystem corruption to clean 
up (which leaves lost files & corrupted file data for some small subset of 
user files).


