[Beowulf] slow mpi init/finalize
cap at nsc.liu.se
Tue Oct 17 09:01:04 PDT 2017
On Tue, 17 Oct 2017 10:59:41 -0400
Michael Di Domenico <mdidomenico4 at gmail.com> wrote:
> > I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u mpiexec.hydra ...
> > You may want to set I_MPI_DEBUG=4 or so to see what it does.
> i can confirm that the dapl test with intelmpi is pretty speedy.
It may be interesting to see what it picks by default (compare output
> when i startup an mpi job without dapl enabled it takes ~60 seconds
I think you mean using the default dapl provider (vs. the specific ucm
provider I suggested). IntelMPI should default to dapl on Mellanox
regardless of version I think (unless possibly if your IntelMPI is very
new and you have a libfabric version installed...).
> before the test actually starts, with dapl enabled it's only a few
That is still very slow. For reference I timed 1024 rank startup on one
of our systems with IntelMPI and dapl on ucm and it's a bit below 0.5s
depending on how you time it (some amount of lazy init is happening).
If I force IntelMPI on that system to run using verbs,
I_MPI_FABRICS=ofa, then that startup takes 5 seconds (~10x slower).
I have not tested a dapl provider using rdmacm as that would require me
to change our system dat.conf I think..
Either way, with 60s time scales and ibacm so broken it fails instantly
I suspect you have some hostname/dns/tcp-ip-on-eth or other fundamental
More information about the Beowulf