Cluster-wide overclocking...

Robert G. Brown rgb@phy.duke.edu
Thu, 24 Sep 1998 10:09:57 -0400


On Wed, 23 Sep 1998, Shachar Tal wrote:

> However, sanity checks in this particular case show that 1 in approx. 1.2
> billion calculations fail, and then computing it again covers up for the
> glitch, and we still benefit from overclocking.

Hmmm, on a beowulf with (say) 10 CPUs executing (say) only 100 million
floating point calculations per second each, that means that you have a
failure a second -- and you still benefit?  Calculating every number
twice seems like it might double the time required to complete a
calculation, and anything less will not necessarily reveal the problem.
Then, how can you actually tell that the problem exists?  My beowulf
performs on the close order of (that is, within a factor of ten or so)

100x10^6 x 50 x 86400 = 4x10^14 

floats a day.  At one error per 10^9, that is around 10^5 errors/day.  I
cannot even think of a sanity check that would work in this case in
Monte Carlo code (my particular problem) -- random errors at this rate
undoubtedly have a distribution and undoubtedly the distribution
satisfies the Central Limit Theorem, so two daylong runs will both
produce the same -- wrong -- answer.  This would be true even if the
runs were only ten or twenty minutes long.

Would you trust a calculation performed with faulty memory?  Would you
trust a calculation on a beowulf interconnected with a modem on noisy
phone lines?  Sure you have parity, or ECC, or CRC tests, but they all
fail with some probability that is WAY TOO HIGH when you are looking at
10^14 (that's 100 TRILLION) FLOPS a day (I know, whatever a FLOP or an
IP is:-).  I worry about the failure rate on HEALTHY NON OVERCLOCKED
CPUs.  If the CPU itself has a floating point bug (as several Intel
processors in the past have had;-) or has a even a very infrequent error
rate, current CPU speeds make it increasingly likely that the error will
show up in real time. The same problem exists with many random number
generators -- an algorithm with a period that used to be "infinity" ten
years ago can cycle once or twice a day today. 

I'd consider 1 error/10^9 cycles just plain "broken" on a chip with a
~10^9 Hz clock...

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu