[Beowulf] Curious about ECC vs non-ECC in practice

Fri May 20 08:52:43 PDT 2011

On 5/20/11 8:35 AM, "Tony Travis" <a.travis at abdn.ac.uk> wrote:

>On 20/05/11 05:35, Joe Landman wrote:
>> Hi folks
>>
>>     Does anyone run a large-ish cluster without ECC ram?  Or with ECC
>> turned off at the motherboard level?  I am curious if there are numbers
>> of these, and what issues people encounter.  I have some of my own data
>> from smaller collections of systems, I am wondering about this for
>> larger systems.
>
>Hi, Joe.
>
>Apparently this is still a big issue for computers in space, using
>non-ECC RAM for solid-state storage on grounds of cost for imaging.
>They, apparently, use RAM background SoftECC 'scrubbers' like this:
>
>http://pdos.csail.mit.edu/papers/softecc:ddopson-meng/softecc_ddopson-meng
>.pdf
>
>

Yes, it's a big tradeoff in the space world. Not only does ECC require
extra memory, but the EDAC logic consumes power and, typically, slows down
the bus speed (I.e. You need an extra bus cycle to handle the EDAC logic
propagation delay).

There's also a practical detail that the upset rate might be low enough
that it is ok to just tolerate the upsets, because they'll get handled at
some higher level of the process.

For instance, if you have a RAM buffer in a communication link handling
the FEC coded bits, then there's not much difference between a bit flip in
RAM and a bit error on the comm link, so you might as well just let the
comm FEC code take care of the bit errors.

We tend to use a lot of checksum strategies.  Rather than an EDAC strategy
which corrects errors, it's good enough to just know that an error
occurred, and retry. This is particularly effective on Flash memory, which
has transient read errors: read it again and it works ok.

Another example is doing an FFT.  There are some strategies which allow
you to do a second fast computation that essentially provides a "check" on
the results of the FFT (e.g. The mean of the input data should match the
"DC term" in the FFT)

We might also keep triple copies of key variables.  You read all three
values and compare them before starting the computation.  Software Triple
Redundancy, as it were.  A lot of times, the probability of an error
occurring "during" the computation is sufficiently low, compared to the
probability of an error occurring during the very long waiting time
between operating on the data.

There's also the whole question of whether EDAC main memory buys you much,
when all the (ever larger) cache isn't protected.  Again, it comes down to
a probability analysis.

My own personal theory on this is that you are much more likely to have a
miscalculation due to a software bug than due to an upset.  Further, it's
impossible to get all the bugs out in finite time/money, so you might as
well design your whole system to be fault tolerant, not in a "oh my gosh,
we had an error, let's do extensive fault recovery", but a "we assume the
computations are always a bit wonky, so we factor that into our design".
 That is, design so that retries and self checks are just part of the
overhead.  Kind of like how a decent experiment or engineering design
takes into account measurement uncertainty stack-up.

As hardware gets smaller and faster and lower power, the "cost" to provide
extra computational resources to implement a strategy like this gets
smaller, relative to the ever increasing human labor cost to try and make
it perfect.

(and, of course, this *is* how humans actually do stuff.. You don't
precompute all of your control inputs to the car.. You basically set a
general goal, and continuously adjust to drive towards that goal.)

Jim Lux
>