[Beowulf] delayed savings time crashes

Wed Apr 12 10:42:31 PDT 2006

David,

The reboots were due to a City of Pasadena power glitch at 9:17 that 
morning. :)  It was raining, and a 34kV city feeder line that runs between 
the generating plant at the entrance of the 110 and a substation at Del Mar 
& Los Robles faulted.  The responsible breaker took 13 cycles to break, 
during which time the single-phase voltage seen at Caltech dropped to about 
75V.

This info comes from the responsible EE at Caltech.  As for its effects, 
believe me, I know about it the hard way, as it took down 2/3 of our 
compute nodes, 1/3 of our disk shelves, and 3/4 of our fileservers.  Our 
UPS has been on bypass these past 6+ months as we wait for our UPS vendor 
to install a fix so that the UPS can handle the tendency of our computer 
power supplies' internal Power Factor Correction feedback circuitry to lock 
up & induce massive 12Hz oscillations on the room's power lines.

As for the time glitch, that is probably induced by the fact that Daylight 
Savings Time changes only take place on the "system" clock, and in a 
standard Red Hat system those changes only get synced to the hardware clock 
upon a clean shutdown.  So if your machine crashes after a DST change, then 
upon bootup syslogd gets its time from the hardware clock, which is wrong.  
The system clock is only corrected later in the bootup sequence, when ntpd 
starts.  The best solution is probably to set the hardware clock to UCT 
rather than local time.  UCT doesn't undergo step changes like most 
timezones in the U.S. do, so the compensation for DST happens dynamically 
in software, rather than requiring a hardware clock change.

David

On Wednesday 12 April 2006 09:05, David Mathog wrote:
> This is an odd one.  I just realized that 9 of 20 nodes
> rebooted on Apr 4.  (Since they all rebooted successfully everything
> was working and there was no reason to think that this had
> taken place.)  This appears to be related to the daylights
> savings time change two days before.  The reason I think that is
> that the nodes that rebooted have /var/log/messages files like:
>
> Apr   4 08:01:00 nodename CROND ... /cron/hourly
> Apr   4 09:01:00 nodename CROND ... /cron/hourly
> Apr   4 08:24:33 nodename syslogd 1.4.1; restart
>
> Notice the time shift backwards between the last normal
> record and the first reboot record.
>
> As if it finally caught on that the clock had changed and that
> somehow triggered a reboot.  Unfortunately none of the log files
> contain a message that indicated exactly what it was that ordered
> the reboot.
>
> Unclear to me what piece of software could have triggered this.
> Presumably something that had it's own clock stuck one hour off
> on the previous time standard and also has the ability to restart
> the system.  ntpd?  Ganglia? They were both running.
>
> Regards,
>
> David Mathog
> mathog at caltech.edu
> Manager, Sequence Analysis Facility, Biology Division, Caltech
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf