[Beowulf] Problems with a JS21 - Ah, the networking...
patrick at myri.com
Mon Oct 1 06:35:38 PDT 2007
Ivan Paganini wrote:
> The myrinet connection was working right, but sometimes a user program
> just got stuck - one of the processes was sleeping, and all others
> were running. Then, the program hangs. Investigating this further,
Unless you are using bocking receives ("--mx-recv blocking" or
"--mx-recv hybrid"), the default mode is polling. So, a process will
only sleep if it is still in the spawning phase (in MPI_Init) or if it's
blocking on something outside MPI (like disk IO).
> overheat. mpirun.ch_mx -v shows that all the processes are issued ok
> to the nodes, but somehow one (or more) process go to sleep or never
> starts, and all the other processes just hangs. The mx diagnose tools
All processes wait on everybody at spawn time, so if one process never
starts, the rest of the MPI world will wait for it, possibly forever.
The root problem is the process not starting.
The spawning phase in MPICH-MX uses socket and ssh (or rsh). Usually,
ssh uses native Ethernet, but it could also use IPoM (Ethernet over
Myrinet). Which case is it for you ?
More information about the Beowulf