[Beowulf] An annoying MPI problem

Lombard, David N dnlombar at ichips.intel.com
Thu Jul 10 07:53:09 PDT 2008


On Tue, Jul 08, 2008 at 07:01:48PM -0700, Joe Landman wrote:
> Hi folks
> 
>    Dealing with an MPI problem that has me scratching my head.  Quite
> beowulfish, as thats where this code runs.
> 
>    Short version:  The code starts and runs.  Reads in its data.  Starts
> its iterations.  And then somewhere after this, it hangs.  But not
> always at the same place.  It doesn't write state data back out to the
> disk, just logs.  Rerunning it gets it to a different point, sometimes
> hanging sooner, sometimes later.  Seems to be the case on multiple
> different machines, with different OSes.  Working on comparing MPI
> distributions, and it hangs with IB as well as with shared memory and
> tcp sockets.
...
> I'll try all the usual things (reduce the optimization level, etc).
> Sage words of advice (and clue sticks) welcome.

Not trying to sound like an ad...

The currently shipping Intel Trace Collector and Analyzer (7.1), includes
message correctness checking.  An option is available that adds a
library to an Intel MPI build that checks messages during the run.
You can then view any errors it found in the Intel Trace Analyzer.

This may find there's a problem that has only just started to trip the
code up.  I certainly have welts from those; I suspect others do too.

-- 
David N. Lombard, Intel, Irvine, CA
I do not speak for Intel Corporation; all comments are strictly my own.



More information about the Beowulf mailing list