[Beowulf] precise synchronization of system clocks
larry.stewart at sicortex.com
Mon Sep 29 15:00:04 PDT 2008
On Sep 29, 2008, at 4:10 PM, Prentice Bisbal wrote:
> In the previous thread I instigated about running services in cluster
> nodes, there was some mentioning of precisely synchronizing the system
> clocks and this issue is also mentioned in this paper:
> "The Case of Missing Supercomputer Performance: Achieving Optimal
> Performance on the 8,192 processor ASCI Q" (Petrini, Kerbisin and
> I've also read a few other papers on the topic, and it seems you
> need to
> sync the system clocks to ~1 uS. On top of that, I imagine you also
> to synch the activities of each system so they all stop to do the same
> system-level tasks at the same time.
> The papers I read all mentioned different OSes, or at least
> hardware. Can this level of synchronization be achieved in Linux on
> commodity hardware? I imagine NTP doesn't have the resolution needed
> for this, and Don Becker has some strong feelings against NTP.
The SiCortex systems I work on are not commodity, but they do run
Linux. All the node chips in the machine are frequency locked to the
same oscillator, so the core cycle counters (MIPS standard) advance at
the same rate, but because the cores are released from reset at
different times, they are not initially synchronized. We recently
added a global clock synchronization step to booting the system by
timestamping messages sent over an out-of-band channel of the
interconnect. After some futzing around, we're able to synchronize all
the cycle counters to within about 50 nanoseconds. The timer
interrupts then happen at the same counter values system wide, which
naturally synchronizes most of the daemons that wake up. I don't
think we've gone to the trouble of gang scheduling them as well, which
would also be a good idea.
We tried reducing the standard 1000Hz timer interrupts to 100 Hz, but
a bunch of stuff in the IP network stack reacted badly, slowing down
IP communications. We haven't tracked it all down yet.
As one would expect from the papers you cite, the clock
synchronization has had a very dramatic effect on large scale
collectives - a 5800 rank 8-byte allreduce is now down to 36
microseconds, where it was something like 170 microseconds before the
Since clusters built from commodity servers run on independent
oscillators, it it much harder to synchronize them - NTP will do a
very good job estimating the relative frequencies, but all those
oscillators will drift independently with temperature and aging, so
you have to run NTP continually.
However, the problem to solve - synchronizing local clocks with each
other, is different from the one NTP is intended to solve. You don't
really care what the wall clock time is, you only care that all the
systems have the same time.
I've seen some other papers on the subject of using LAN timestamps to
provide much more accurate local synchronization. Here's one that
cites 10 microsecond results:
High-Precision Relative Clock Synchronization Using Time Stamp Counters
Guo-Song Tian; Yu-Chu Tian; Fidge, C.
Engineering of Complex Computer Systems, 2008. ICECCS 2008. 13th IEEE
International Conference on
Volume , Issue , March 31 2008-April 3 2008 Page(s):69 - 78
Incidently, a good way to measure the effects of OS noise locally is
to write a program that reads the core cycle counter in a tight loop,
and keeps statistics on the intervals between successive samples. You
can find out how often and for how long your OS is going out to lunch.
More information about the Beowulf