[Beowulf] Not quite Walmart, or, living without ECC?

Mark Hahn hahn at mcmaster.ca
Fri Nov 16 13:56:16 PST 2007


> I just asked the local NT goon, "do you use ECC for the servers?" and
> he answered, "you have to". What he considers a server-class mobo
> requires ECC

whether you need ECC depends on many things.  first, how much memory 
your machine has - my experience is that most generic servers (web, file,
mail, etc) don't have much - maybe a few GB.  the chance of needing ECC
also depends on how "hard" you use the ram (again, mundane servers are 
pretty lightly utilized.)  as well as factors like altitude, ram quality,
and the ever popular "how important is your data".

for clusters, I would say that ECC is basically a necessity, unless all
the jobs can be run in a "checking" mode (ie, perform a search or
optimization, then verify the results in case the hit was due to a bit flip.)

that said, ECC events are not all that common.  I have a 768-node cluster
here, each node dual-socket opteron with 8GB PC3200 ddr.  I just checked all 
nodes with mcelog, and 35 have reported corrected events over roughly
the last 20 days.  one may have hit an uncorrectable event (but in our 
clusters, corrected ECC rate is not a good predictor for uncorrectable
ones...)

> and he added that the tendency is now to FB-DIMM (fully
> buffered, http://en.wikipedia.org/wiki/FBDIMM). This suggests to me
> that next year(s) commodity mobos will be ECC.

nah on both counts.  I don't think anyone would claim that FBD is tearing
up the market - you can reasonably argue that it was a stopgap to let Intel
increase the memory capacity of chipsets whose MCH had inadequate fan-out. 
FBD is not a dumb idea, just not necessarily valuable enough to win.

 	- the extra AMB has been a heat problem in the past and is,
 	no matter how improved, still extra cost and space.

 	- the design trades off latency and expandability.  FBDIMMS
 	were designed with 8 dimms/channel and up to 6 channels.
 	that's pretty huge capacity - afaik, 4ch is the max implemented
 	and then with 2-4/ch.  presumably to avoid taking too bad of
 	a latency hit, since FBD's are daisy-chained, and even one of
 	them is slower than an AMB-less dimm...

 	- FBD would be more attractive if dram chips themselves were
 	not increasing in capacity (like cpus and disks - all area-based,
 	and thus following moore's law).

 	- attaching memory directly to cpus has the nice property of
 	scaling with server "size".  AMD led Intel to this realization ;)

 	- it's unclear to me whether more cores onchip will lead to a push
 	for more memory capacity per system.  then again, I don't think the
 	world is crying out for 8-core chips, either.

I suspect that FBD will have only a little more market/history footprint
than RDRAM did.



More information about the Beowulf mailing list