[Beowulf] cheap PCs this christmas

Mark Hahn hahn at physics.mcmaster.ca
Wed Nov 23 07:02:29 PST 2005


> I work with systems for which, literally, "failure is not an 
> option"  (actually, we call that criticality=1.. loss of life or 

so what probability value (fit, mtbf, etc) do you assign to this?
"cannot fail" is not an option for this reality, of course...

> But how many of those corruptions would have resulted in an error had they 
> not been caught?

unclear, but something I've always wondered about.  it's easy to imagine 
bit-flips that would be made permanent (files being written out, etc).
but it's also easy to imagine flips that would have no effect (a flip
in a part of the data you've already processed, for instance).  as well
as flips which would fail nicely (segv, etc).

> Is the cache in your processor ECC?  What's the impact on your performance 
> of cache hit/miss vis a vis ECC and/or bit flips.

yes, commodity processors do ECC on cache and datapaths these days.
it's unclear how many applications could notice the extra cycle of 
dram latency due to ECC - something pointer-oriented like GCC probably
would show a small effect.  lots of codes are pretty cache-friendly, though,
or else bandwidth-intensive enough not to notice a cycle of latency.




More information about the Beowulf mailing list