Try disabling shared memory only.<br>Open MPI shared memory buffer is limited and it enters deadlock  if you overflow it. <br>As Open MPI uses busy wait, it appears as a livelock. <br><br><br><div class="gmail_quote">2008/7/9 Ashley Pittman <<a href="mailto:apittman@concurrent-thinking.com">apittman@concurrent-thinking.com</a>>:<br>

<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div class="Ih2E3d">On Tue, 2008-07-08 at 22:01 -0400, Joe Landman wrote:<br>

>    Short version:  The code starts and runs.  Reads in its data.  Starts<br>

> its iterations.  And then somewhere after this, it hangs.  But not<br>

> always at the same place.  It doesn't write state data back out to the<br>

> disk, just logs.  Rerunning it gets it to a different point, sometimes<br>

> hanging sooner, sometimes later.  Seems to be the case on multiple<br>

> different machines, with different OSes.  Working on comparing MPI<br>

> distributions, and it hangs with IB as well as with shared memory and<br>

> tcp sockets.<br>

<br>

</div>Sounds like you've found a bug, doesn't sound too difficult to find,<br>

comments in-line.<br>

<div class="Ih2E3d"><br>

>    Right now we are using OpenMPI 1.2.6, and this code does use<br>

> allreduce.  When it hangs, an strace of the master process shows lots of<br>

> polling:<br>

<br>

</div>Why do you mention allreduce, does it tend to be in allreduce when it<br>

hangs?  Is it happening at the same place but on a different iteration<br>

every time perhaps?  This is quite important, you could either have a<br>

"random" memory corruption which can cause the program to stop anywhere<br>

and are often hard to find or a race condition which is easier to deal<br>

with, if there are any similarities in the stack then it tends to point<br>

to the latter.<br>

<br>

allreduce is one of the collective functions with an implicit barrier<br>

which means that *no* process can return from it until *all* processes<br>

have called it, if you program uses allreduce extensively it's entirely<br>

possible that one process has stopped for whatever reason and have the<br>

rest continued as far as they can until they too deadlock.  Collectives<br>

often get accused of causing programs to hang when in reality N-1<br>

processes are in the collective call and 1 is off somewhere else.<br>

<div class="Ih2E3d"><br>

> c1-1:~ # strace -p 8548<br>

<br>

</div>> [spin forever]<br>

<br>

Any chance of a stack trace, preferably a parallel one?  I assume *all*<br>

processes in the job are in the R state?  Do you have a mechanism<br>

available to allow you to see the message queues?<br>

<div class="Ih2E3d"><br>

> So it looks like the process is waiting for the appropriate posting on<br>

> the internal scoreboard, and just hanging in a tight loop until this<br>

> actually happens.<br>

><br>

> But these hangs usually happen at the same place each time for a logic<br>

> error.<br>

<br>

</div>Like in allreduce you mean?<br>

<div class="Ih2E3d"><br>

> But the odd thing about this code is that it worked fine 12 - 18 months<br>

> ago, and we haven't touched it since (nor has it changed).  What has<br>

> changed is that we are now using OpenMPI <a href="http://1.2.6." target="_blank">1.2.6.</a><br>

<br>

</div>The other important thing to know here is what you have changed *from*.<br>

<div class="Ih2E3d"><br>

> So the code hasn't changed, and the OS on which it runs hasn't changed,<br>

> but the MPI stack has.  Yeah, thats a clue.<br>

<br>

> Turning off openib and tcp doesn't make a great deal of impact.  This is<br>

> also a clue.<br>

<br>

</div>So it's likely algorithmic?  You could turn off shared memory as well<br>

but it won't make a great deal of impact so there isn't any point.<br>

<div class="Ih2E3d"><br>

> I am looking now to trying mvapich2 and seeing how that goes.  Using<br>

> Intel and gfortran compilers (Fortran/C mixed code).<br>

><br>

> Anyone see strange things like this with their MPI stacks?<br>

<br>

</div>All the time, it's not really strange, just what happens on large<br>

systems, expecially when developing MPI or applications.<br>

<div class="Ih2E3d"><br>

> I'll try all the usual things (reduce the optimization level, etc).<br>

> Sage words of advice (and clue sticks) welcome.<br>

<br>

</div>Is it the application which hangs or a combination of the application<br>

and the dataset you give it?  What's the smallest process count and<br>

timescale you can reproduce this on?<br>

<br>

You could try valgrind which works well with openmpi, it will help you<br>

with memory corruption but won't help be of much help if you have a race<br>

condition.  Going by reputation Marmot might be of some use, it'll point<br>

out if you are doing anything silly with MPI calls, there is enough<br>

flexibility in the standard that you can do something completely illegal<br>

but have it work in 90% of cases, marmot should pick up on these.<br>

<a href="http://www.hlrs.de/organization/amt/projects/marmot/" target="_blank">http://www.hlrs.de/organization/amt/projects/marmot/</a><br>

<br>

We could take this off-line if you prefer, this could potentially get<br>

quite involved...<br>

<font color="#888888"><br>

Ashley Pittman.<br>

</font><div><div></div><div class="Wj3C7c"><br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org">Beowulf@beowulf.org</a><br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</div></div></blockquote></div><br>