Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] delayed savings time crashes

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

David Kewley kewley at gps.caltech.edu
Wed Apr 12 12:57:30 PDT 2006


On Wednesday 12 April 2006 11:34, David Mathog wrote:
> Hmm, now that we know the cause of it that might explain
> why all those that did reboot were plugged into just 2 surge
> suppressors, where the loss was 9/10 machines, whereas the
> other 2 surge suppressors lost 0/10 machines.  Each surge
> suppressor is on its own circuit which is 1/3rd of a 3 phase line.
> Maybe only one phase had the glitch and by good luck the
> two circuits which lost no machines were wired between the
> two good phases?

I do not know how this worked, but I did see something similar but even 
stranger.  Our UPS feeds two PDUs, each responsible for about 1/2 the 
computers.  One PDU saw all computers on phases 1 & 2 fail, and the other 
saw all computers on phases 1 & 3 fail.  On both PDUs, the third, 
unaffected phase saw all its computers stay up.  I have no idea how to 
explain this.

> > This info comes from the responsible EE at Caltech.  As for its
> > effects, believe me, I know about it the hard way, as it took down 2/3
> > of our compute nodes, 1/3 of our disk shelves, and 3/4 of our
> > fileservers.
>
> That's a lot of machines in your case.  Did any sustain permanent
> damage?

It was a voltage drop rather than a spike, and that probably explains why we 
had no hardware damage.  Just quite a bit of filesystem corruption to clean 
up (which leaves lost files & corrupted file data for some small subset of 
user files).

David



More information about the Beowulf mailing list