[Beowulf] Strange hardware? problems

Sat Apr 28 05:13:40 PDT 2007

On Fri, 27 Apr 2007, Orion Poplawski wrote:

> I'm at a loss and trying to see if anyone else has had similar problems.
>
> We've got two pairs of identical machines:
>
> - 2 Tyan S2882 dual processor Opteron 244 stepping 10
> - 2 Tyan S2882-D dual processor dual core Opteron 275 stepping 2
>
> We have two (relatively complicated) numerical models (RAMS and a homegrown 
> one) that will blow up in random locations on the 244 machines but run fine 
> on the 275 machines.
>
> By blow up it appears the calculations get corrupted in some way and the 
> numbers get un-physical in RAMS and the simulation exits.  With the other 
> model we get segfaults.
>
> Memtest86 runs fine.  No other hardware issues that I can find.

Your problem sounds like the following (to me).  Your program starts up,
and uses dynamic memory to allocate this or that according to some sort
of simulation or execution pathway.  Alas, it leaks, or else you haven't
correctly estimated the memory required for it to complete.  You for
whatever reason didn't configure swap on the systems (which would
typically slow the computation down to where its need for VM became
obvious) and so eventually it exhausts memory.  At that point (which
would be slightly different per run because it depends on just what the
kernel has resident and a lot of other factors) it either causes
resident kernel/systems programs that try to swap in to randomly fail,
causing a system crash or unpredictable behavior or it causes a segment
violation in the program itself.

To other possibilities are: corrupt or incorrect libraries (depending on
what the computations use, there can always be bugs in some little
tested pathway in some library, and of course bugs in the code.

It is (IMO) VERY VERY unlikely that you've found a CPU bug running
ordinary numerical code, especially two distinct versions of it.  They
MIGHT share a buggy library, though.  It is not at all unlikely that
there is a bug in your code, but not so likely that there is a similar
bug in both programs, unless the "bug" is that it uses too much memory
and does unpredictable things when it runs out.

Things to try to debug -- run a memory monitor -- e.g. vmstat, wulfstat,
on the cluster as the programs run.  Check their resident and virtual
memory occupation as the program proceeds.  If their memory footprint
steadily grows, you've got (and found) a problem.  You may see other
things that give you a clue as to what might be going on.  Configure a
swap (if you haven't got one already) and see if it affects the problem.

Finally, instrument the code in the version you compile from source and
dump a trace of sorts so you can "see" where it fails (old timey "real
debugging").  When you see where it is, try to figure out what's going
on by increasing the instrumentation.  This can result in big output
files to go through, but is your best bet in the long run.

    rgb

>
> We've tried FC4/5 on the 244 machines.  At one point all were running 
> identical FC5 installs with the same problems.
>
> Problem is not exactly reproducible unfortunately.  It will crash at 
> different times in the simulations, but they will crash at some point with 
> the length of runs we are doing.
>
> Are there any cpu tests out there that would check the accuracy of various 
> calculations?
>
> Thanks!
>
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu