[Beowulf] Intel MPI 2.0 mpdboot and large clusters,
slow tostart up, sometimes not at all
deadline at clustermonkey.net
Thu Oct 5 13:12:22 PDT 2006
One of the issues I had with using mpd with Sun Grid Engine was that
if it starts the mpd daemons on a per job basis, you can run into
problems with the scheduler starts another mpd on the same node
for a different MPI job.
The solution I used (as described by the SGE integration docs)
was to use the spmd with a unique port number. This is similar to
using LAM_MPI_SOCKET_SUFFIX for lam runs.
I am not sure how well spmd startup and shutdown scales, I have
only used it with small numbers of nodes (< 100)
> If you have a batch system that can start the MPDs, you should
> consider starting the MPI processes directly with the batch system
> and providing a separate service to provide the startup information.
> In MPICH2, the MPI implementation is separated from the process
> management. The mpd system is simply an example process manager
> (albeit one with many useful features). We didn't expect users to
> use one existing parallel process management system to start another
> one; instead, we expected that those existing systems would use the
> PMI interface used in MPICH2 to directly start the MPI processes. I
> know that you don't need MPD for MPICH2; I expect the same is true
> for Intel MPI.
> On Oct 4, 2006, at 11:31 AM, Bill Bryce wrote:
>> Hi Matt,
>> You pretty much diagnosed our problem correctly. After discussing
>> the customer and a few more engineers here we found that the python
>> was very slow at starting the ring. Seems to be a common problem with
>> MPD startup on other MPI implementations as well (I could be wrong
>> though). We also modified the recvTimeout since onsite engineers
>> suspected that would help as well. The final fix we are working on is
>> starting the MPD with the batch system and not relying on ssh - the
>> customer does not want a root MPD ring and wants one per job so the
>> batch system will do this for us.
>> -----Original Message-----
>> From: M J Harvey [mailto:m.j.harvey at imperial.ac.uk]
>> Sent: Wednesday, October 04, 2006 12:23 PM
>> To: Bill Bryce
>> Cc: beowulf at beowulf.org
>> Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow
>> tostart up, sometimes not at all
>>> We are going through a similar experience at one of our customer
>>> They are trying to run Intel MPI on more than 1,000 nodes. Are you
>>> experiencing problems starting the MPD ring? We noticed it takes a
>>> really long time especially when the node count is large. It also
>>> doesn't work sometimes.
>> I've had similar problems with slow and unreliable startup of the
>> mpd ring. I noticed that before spawning the individual mpds, it
>> connects to each node and checks the version of the installed python
>> (function getversionpython() in mpdboot.py). On my cluster, at least,
>> this check was very slow (not to say pointless). Removing it
>> dramatically improved startup time - now it's merely slow.
>> Also, for jobs with large process counts, it's worth increasing
>> recvTimeout in mpirun from 20 seconds. This value governs the
>> amount of
>> time mpirun waits for the secondary mpi processes to be spawned by the
>> remote mpds and the default value is much too aggressive for large
>> started via ssh.
>> Kind Regards,
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf