Cluster-wide overclocking...
Robert G. Brown
rgb@phy.duke.edu
Fri, 25 Sep 1998 14:14:07 -0400
On Thu, 24 Sep 1998, Kragen wrote:
> On Thu, 24 Sep 1998, Robert G. Brown wrote:
> > There are other cases where it would be a disaster. If you are trying
> > to certify the passwd file as being "unbreakable" for a client in a
> > million dollar banking system and some kid running crack turns around
> > and finds it on a NON overclocked system in ten seconds,
>
> Well, this is still a bad example, because no reputable security person
> will certify a passwd file as unbreakable. The bad guys always have
> better dictionaries than the good guys, because the bad guys
> packet-sniffed netcom for several months and they know what kinds of
> passwords people *really* choose.
Absolutely. However, I was making a point illustrated clearly in most
decent probability texts: that in decision theory one has to consider
false positives and false negatives, and that one has to include the
cost function. The typical examples given in the probability texts
consider e.g. a medical test with some rates of failure (both positive
and negative), and then go on to balance the (statistical) costs and
benefits administering the test when e.g. the outcome of a false
negative is a person dying of cancer "unnecessarily" vs. something
else. I thought that the medical analogies were less apropos than an
admittedly slightly stretched passwd cracking example. Which isn't
that stretched -- sysadmins regularly run crack on their own passwd
files, not to "certify" them, but to guard against the costs of a
successful crack. So scale the numbers a bit and pick a point where
the cost of false negatives isn't negligible. The point I was making
is quite clear.
> BTW, there was an article somewhere recently about an experimental
> machine in which all the components were defective. The machine was
> just massively redundant, and cheaper to build than a non-redundant
> machine of equivalent power. There was a pointer to it on slashdot.org
> in the last week or two, IIRC.
Sure -- this is the motivation behind CRC and the like as well.
Perfectly clean phones lines are expensive. Noisy phone lines are
cheap and plentiful, but linear transmission of data is easily
corrupted and can only be linearly tested -- at a higher but still not
that high -- rate of reliability by multiple transmissions which lower
the effective speed. One then has a pure information theory problem
-- how to add minimally to the data transmitted in order to a) avoid
retransmitting everything; b) flag errors in the data stream at a high
rate of probability; c) optimize the amount of data that has to be
retransmitted as a result of errors at some presumed rate to maximize
overall throughput.
So, you packetize the data, generate a flag that characterizes a
packet in a way that utilizes far less information than that contained
in the packet (e.g. a parity bit, a CRC or MD5 sum, etc.) and send the
error check along with the data. At the other end (where CPU is
presumed much faster than the data transmission) you recalculate the
error parameter from the transmitted data and compare. If the error
data doesn't match, you force a retransmit. You can then optimize
(even in real time) the size of the packets to the number of forced
retransmissions to maintain maximum throughput, and obtain transmission
speeds less than an expensive clean line but greater than the cheap
noisy line would otherwise support.
HOWEVER, you also introduce the bet. Obviously, if parity is your
only indicator, TWO bit errors in one packet can go undetected. If
CRC or MD5 or whatever is your indicator, you can raise the bar and
make it so that only N-bit errors can be missed for some N, but if it
weren't possible to miss errors for ANY error detecting methodology,
the error data would be informationally redundant with the original
message and you are back to transmitting twice as slowly as you need
to. Nowadays, of course, modems routinely handle most of this for you
and even do an additional step (or two) of information compression
because a lot of data streams are NOT informationally efficient.
However, even with compression and error detection/correction (and no,
I'm not addressing automatic error correction, which is possible with
some methodologies at less cost than a full retransmission) there is
some finite chance that one will get bad data over a noisy phone line
(however much smaller than it is with linear transmission) and this is
a bet that I have lost before on more than one occassion. Who here
hasn't downloaded a file over a modem and found that the image was
corrupted?
The same problem exists throughout the system, software and hardware.
In order to make a system robust at the bit level, lots of this is
built right into the CPU itself and into all sorts of intermediate
levels, e.g. parity memory, ECC memory. If one's hypothesis is that
bit errors increase in probability at some rate with system clock, one
can easily push the probability of having a really harmful result over
a threshold where otherwise it is very low. The kind of error that
would really annoy ME would be certain bit errors corrupting parts of
the kernel itself -- all I need is to trash files or a whole disk by
having inode information corrupted when the kernel is in the middle of
a write. The reason that Alan Cox is annoyed in general by
overclockers is that they don't realize that the kernel, as one of the
busiest tasks on many computers, is one of the FIRST places that
bitlevel corruption occurs and it really doesn't like it -- its error
tolerance threshold is very low because it DOESN'T include a ton of
error correction because it is expensive and that makes the kernel
inefficient. So random things die and overclockers assume that it is
a broken kernel and complain, annoying just about everybody.
> > Really, one can actually make this completely mathematical given any
> > expected rate of failure and a reasonable knowledge of the classes of
> > failure that can occur,
>
> That's the kicker, though. Expected rates of failure in computers tend
> to be badly unpredictable; classes of failure tend to be difficult to
> enumerate.
Totally agreed. I'm really playing games here -- proposing
oversimplified situations designed to illustrate a point. In fact,
this is really the point I'm trying to make: Overclocking is
dangerous, virtually nobody who overclocks has any idea how to
properly analyze the risks (because expected rates of failure are
badly unpredictable and the classes of failure difficult to
enumerate:-), and occassionally overclockers get badly burned as a
result. The non-denumerable risks increase on a beowulf where one
overclocks lots of systems instead of just one. Conclusion:
overclocking is really not a very good idea for professionals where
the cost of failure can be quite high (unless they work very hard at
making their systems robust against the most likely errors), however
amusing it is for hobbyists where the cost of failure is low:-).
> > Statistics is one where that is VERY DANGEROUS to make
> > assumptions based on naive frequency interpretations,
>
> Yeah. I thought about your Monte Carlo calculations; it seems to me
> that one or two errors could easily swamp the whole run, on many kinds
> of Monte Carlo calculations. If you're calculating an approximation of
> a value X lots of times and then averaging your approximations to get a
> better X, in particular. Suppose that X is really about a billion.
> One single-bit error in a floating-point exponent could result in
> estimating an X that's closer to 10^20 or so.
>
> Your original thought that occasional one-bit errors wouldn't matter
> seems like a naive assumption about frequency distributions. But maybe
> your Monte Carlo calculations are of a type that wouldn't be affected
> in quite this way.
*Sigh* You caught me. My only excuse is that I don't think that
single bit errors are likely modes of failure, as parity tends to be
checked and corrected in hardware. Or maybe I should argue that I
included that in my Bayesian analysis:-). If I got a result off by
10^20, I'd notice because it would no longer be "sane" and redo the
run. Or (more likely) the error would affect the kernel instead of my
job as the CPU ALU went crazy and my system would crash in mid run.
Of course, rerunning I might well get another insane answer (or crash)
in which case my smart course would be to say, "Gee, time to set the
system clock back to spec".;-)
But the errors might not be bit errors or anything so predictable in
its outcome. The real problem, as you note, is that we don't really
know all the classes of errors that can occur, or how to estimate
their impact. I could easily imagine a threshold in a CPU being
effectively lowered by temperature so that an inner loop in a
microcode counter stutters and a bit is shifted left by two instead of
one (for example). A factor of only two might easily leave my results
within a fuzzy rejection threshold -- they might be sane but wrong.
Or something else.
In a way, the delicacy of a computer system is an advantage. Most
mutations of code or data, just like most genetic mutations, are so
harmful that they are immediately "fatal". The problematic ones are
the ones on the margin -- bad but not so bad that they die. (We will
ignore, for now, the probability of evolving "good" computer programs
via favorable mutations:-) Raising the clock on the CPU is essentially
increasing the rate of overall mutations -- kind of like living in a
radiation bath. Sure, the body has repair mechanisms for genetic
damage, but they can be overwhelmed and you can get unlucky and a cell
can have a cancerous mutation that doesn't die and isn't caught by
defenses. Oops. The odds of this happening go up if you live near
Chernobyl. The odds of cancerous code mutation go up if you overclock
(and are non-zero in both cases even if you don't). Some people are
comfortable with this, I dunno, but I don't plan to seek a nuclear tan
on my summer vacation.
All this stuff is really interesting -- computer systems are real
"complex systems" with statistical as well as deterministic
components, and as they get fast enough to execute a quadrillion
instructions or ten a day (within a factor of ten of Avogadro's number
of instruction a day:-) statistical mechanics and fuzzy logic and
decision theory and information theory all will need to be utilized to
keep them (approximately) operationally deterministic. This will
lower the thresholds for negative sequelae of clock-speed tampering
(which are typically very non-linear in the first place) still
further.
When one builds a system running at 800 MHz, where the only way they
got it to run that fast was to INCLUDE all sorts of redundancy and
error correction in the system design, and you try to squeeze it up to
1000 MHz, you may well push it WAY past the point where the system is
robust, and the failures may be very subtle indeed (all the obvious
and catastrophic varieties being guarded against, and the subtle ones
may be very unlikely at the rated clock but inevitable at 1000MHz).
The probability of radiative crosstalk between VLSI current pathways,
for example, is a very nonlinear function of clock -- there can
literally be resonance phenomena and interference phenomena as one
starts pushing into the ever higher frequency regime. You have all
these itty-bitty oscillators, each of which delivers linear
(predictable) responses >>in a given frequency domain<< that get
increasingly nonlinear as you overdrive them.
Call me hopelessly conservative. I prefer not to overclock systems
doing calculations that I hope to publish. I mean, how do I write my
error analysis section to include a probability of non-deterministic
numerical failure (in addition to all the other sources of error)? I
have to be able to assume that when I multiply x and y, I get their
product and not something else if I plan to multiply and sum 10^15
times a day.
rgb