Dual-Athlon Cluster Problems

Joe Landman landman at scalableinformatics.com
Fri Jan 24 10:37:50 PST 2003


A step further:

  I have found that during transport, install, etc. the memory modules,
sitting in the nice DIMM sockets, sometimes manage to wiggle their way
out of proper connections.  An effort to reseat the SIMM and the rest of
the components sometimes make the problems go away.  This is connected
with heat (thermal expansion) and other issues.  This seems to have been
the case with some of my Athlon systems.  Others just had bad ram.

  I typically run memtest for a few days (2 or so over a weekend) on
initial deployment/construction.  Catches bad RAM or MB's quite
quickly.  Allows you to reduce your unknown problem space quite a bit,
which is useful in debugging dead/dying hardware.

Joe

On Thu, 2003-01-23 at 14:15, Erwan Velu wrote:
> Le jeu 23/01/2003 à 19:52, Martin Siegert a écrit :
> > > 4/ Are there any other outstanding issues with these machines 
> > >    under constant heavy load ?
> > In 99% of all the crashes I have seen on my cluster (and I have seen
> > a lot) the reason was bad memory. If you did not buy memory certified by
> > the company that sold you the motherboard exchange it and your problems
> > will go away.
> Agreed, you should try to boot each node using memtest86
> (http://www.memtest86.com/memtest86-3.0.tar.gz) which is writen in
> assembly code and executed at boot time so it isn't linked with any
> operating system.
> This is the best test I know for being sure that the memory is good.
-- 
Joseph Landman, Ph.D
Scalable Informatics LLC,
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
phone: +1 734 612 4615




More information about the Beowulf mailing list