[Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow tostart up, sometimes not at all

Douglas Eadline deadline at clustermonkey.net
Thu Oct 5 13:12:22 PDT 2006


One of the issues I had with using mpd with Sun Grid Engine was that
if it starts the mpd daemons on a per job basis, you can run into
problems with the scheduler starts another mpd on the same node
for a different MPI job.

The solution I used (as described by the SGE integration docs)
was to use the spmd with a unique port number. This is similar to
using LAM_MPI_SOCKET_SUFFIX for lam runs.

I am not sure how well spmd startup and shutdown scales, I have
only used it with small numbers of nodes (< 100)

 --
 Doug


> If you have a batch system that can start the MPDs, you should
> consider starting the MPI processes directly with the batch system
> and providing a separate service to provide the startup information.
> In MPICH2, the MPI implementation is separated from the process
> management.  The mpd system is simply an example process manager
> (albeit one with many useful features).  We didn't expect users to
> use one existing parallel process management system to start another
> one; instead, we expected that those existing systems would use the
> PMI interface used in MPICH2 to directly start the MPI processes.  I
> know that you don't need MPD for MPICH2; I expect the same is true
> for Intel MPI.
>
> Bill
>
> On Oct 4, 2006, at 11:31 AM, Bill Bryce wrote:
>
>> Hi Matt,
>>
>> You pretty much diagnosed our problem correctly.  After discussing
>> with
>> the customer and a few more engineers here we found that the python
>> code
>> was very slow at starting the ring.  Seems to be a common problem with
>> MPD startup on other MPI implementations as well (I could be wrong
>> though).  We also modified the recvTimeout since onsite engineers
>> suspected that would help as well.  The final fix we are working on is
>> starting the MPD with the batch system and not relying on ssh - the
>> customer does not want a root MPD ring and wants one per job so the
>> batch system will do this for us.
>>
>> Bill.
>>
>>
>> -----Original Message-----
>> From: M J Harvey [mailto:m.j.harvey at imperial.ac.uk]
>> Sent: Wednesday, October 04, 2006 12:23 PM
>> To: Bill Bryce
>> Cc: beowulf at beowulf.org
>> Subject: Re: [Beowulf] Intel MPI 2.0 mpdboot and large clusters, slow
>> tostart up, sometimes not at all
>>
>> Hello,
>>
>>> We are going through a similar experience at one of our customer
>> sites.
>>> They are trying to run Intel MPI on more than 1,000 nodes.  Are you
>>> experiencing problems starting the MPD ring?  We noticed it takes a
>>> really long time especially when the node count is large.  It also
>> just
>>> doesn't work sometimes.
>>
>> I've had similar problems with slow and unreliable startup of the
>> Intel
>> mpd ring. I noticed that before spawning the individual mpds, it
>> connects to each node and checks the version of the installed python
>> (function getversionpython() in mpdboot.py). On my cluster, at least,
>> this check was very slow (not to say pointless). Removing it
>> dramatically improved startup time - now it's merely slow.
>>
>> Also, for jobs with large process counts, it's worth increasing
>> recvTimeout in mpirun from 20 seconds. This value governs the
>> amount of
>> time mpirun waits for the secondary mpi processes to be spawned by the
>> remote mpds and the default value is much too aggressive for large
>> jobs
>> started via ssh.
>>
>> Kind Regards,
>>
>> Matt
>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug



More information about the Beowulf mailing list