[Beowulf] Puzzling Intel mpi behavior with slurm

Peter Kjellström cap at nsc.liu.se
Wed Apr 11 07:38:31 PDT 2018


On Thu, 05 Apr 2018 09:10:57 -0600
Faraz Hussain <info at feacluster.com> wrote:

> Here's something quite baffling. I have a cluster running slurm but  
> have not setup passwordless ssh for a user yet. So when the user
> runs "mpirun -n 2 -hostfile hosts hostname", it will hang because of
> ssh issue. That is expected.
> 
> Now the baffling thing is the mpirun command works inside a slurm  
> script! How can it work if passwordless ssh has not been configured?  
> Does slurm use some different authentication (munge?) to login to
> the hosts and execute the hostname command?

What happens is that mpirun sees the slurm environment variables and
switches to a slurm aware mode.

In this mode it uses srun to to launch pmi_proxy processes on each node
of the job. Then it proceeds to start all ranks using these pmi_proxy
processes.

The process tree ends up being something like this on the first node:

slurmd->slurmstepd->bash(jobscript)->mpirun->srun -w nodes[..] pmi_proxy

And on the other nodes:

slurmd->slurmstepd->pmi_proxy->rank[0...n]

Authentication/authorization is handled by slurm and depens on how you
set it up (often munge).

Cheers,
 Peter K


More information about the Beowulf mailing list