Cluster-wide overclocking...

Robert G. Brown rgb@phy.duke.edu
Wed, 23 Sep 1998 18:01:05 -0400


On Wed, 23 Sep 1998, Stanley, Jeremy wrote:

> I've been reading a lot on hardware optimization lately.  In general,
> while a lot of small changes make little difference on a stand-alone
> unit, the same change on every node of a beowulf could add up quickly.
> I'm in the preliminary stages of overclocking the processors in all my
> nodes, and looks like I can achieve thousands of dollars of processing
  ...more deleted..

Well, overclocking is great when it works.  The problem is, that when it
doesn't work failure can be anything from catastrophic (which,
paradoxically is good) to subtle.  A catastrophic failure means that you
sigh, revert to the safe clock, and get the right answers.  A subtle
failure can be nothing more than a somewhat more common bit error.  If
you are using a beowulf for numerical calculations, it may be that a bit
error or two is survivable.  On the other hand, it may cause day to turn
into night in a publication in a reputable journal or an engineering
decision, and cost you time, money, embarrassment, and even your job.

The other problem is that even "catastrophic" hardware failures due to
overclocking (e.g. occasional system crashes) can be very difficult to
resolve from problems in software or the kernel.  So overclockers send
lots of mail to e.g. linux-smp wondering why their (fill in the blank)
motherboard with (fill in the blank) memory and other hardware isn't
stable with linux.  When they actually confess to overclocking, a lot of
list folks' standard response is:  Stop Overclocking, Reboot, and if the
problem persist THEN we'll worry about it.

I personally don't overclock -- I have enough headaches from 27 systems
in a mixed beowulf/NOW without adding to them, even statistically.  A
big beowulf or NOW is precisely where the odds of losing the bet (for
overclocking is basically a bet) start getting uncomfortably large,
however small you think that they are for a single system.  To be
concrete, if you assume that your chances are only 1/100 of having a
failure in a year with one system, they are close to 1/4 if you have 30
systems.  Then there is the reduced system life, etc. which may or may
not be a concern.  And remember, not every CPU will run at higher clock
just because you keep it cooler.

Having dissed a bit on overclocking, I must say that I really don't have
a problem with OTHER folks overclocking -- I wish you well (that is, I
hope that you get away with it;-).  I'd be very cautious about
recommending it as standard practice in beowulf design, however.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb@phy.duke.edu