[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Gus Correa gus at ldeo.columbia.edu
Tue Apr 30 13:20:03 PDT 2019


It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
You can force it to use openib (verbs, rdma) with (vader is for in-node
shared memory):

mpirun --mca btl openib,self,vader ...

These flags may also help tell which btl (byte transport layer) is being used:

 --mca btl_base_verbose 30

See these FAQ:
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl
https://www.open-mpi.org/faq/?category=all#tcp-routability-1.3

Better really ask more details in the Open MPI list. They are the pros!

My two cents,
Gus Correa





On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com> wrote:

> Thanks, after buidling openmpi 4 from source, it now works! However it
> still gives this message below when I run openmpi with verbose setting:
>
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
>
>    Local host:           lustwzb34
>    Local device:         mlx4_0
>    Local port:           1
>    CPCs attempted:       rdmacm, udcm
>
> However, the results from my latency and bandwith tests seem to be
> what I would expect from infiniband. See:
>
> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
> ./osu_latency
> # OSU MPI Latency Test v5.3.2
> # Size          Latency (us)
> 0                       1.87
> 1                       1.88
> 2                       1.93
> 4                       1.92
> 8                       1.93
> 16                      1.95
> 32                      1.93
> 64                      2.08
> 128                     2.61
> 256                     2.72
> 512                     2.93
> 1024                    3.33
> 2048                    3.81
> 4096                    4.71
> 8192                    6.68
> 16384                   8.38
> 32768                  12.13
> 65536                  19.74
> 131072                 35.08
> 262144                 64.67
> 524288                122.11
> 1048576               236.69
> 2097152               465.97
> 4194304               926.31
>
> [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile ./osu_bw
> # OSU MPI Bandwidth Test v5.3.2
> # Size      Bandwidth (MB/s)
> 1                       3.09
> 2                       6.35
> 4                      12.77
> 8                      26.01
> 16                     51.31
> 32                    103.08
> 64                    197.89
> 128                   362.00
> 256                   676.28
> 512                  1096.26
> 1024                 1819.25
> 2048                 2551.41
> 4096                 3886.63
> 8192                 3983.17
> 16384                4362.30
> 32768                4457.09
> 65536                4502.41
> 131072               4512.64
> 262144               4531.48
> 524288               4537.42
> 1048576              4510.69
> 2097152              4546.64
> 4194304              4565.12
>
> When I run ibv_devinfo I get:
>
> [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
> hca_id: mlx4_0
>          transport:                      InfiniBand (0)
>          fw_ver:                         2.36.5000
>          node_guid:                      480f:cfff:fff5:c6c0
>          sys_image_guid:                 480f:cfff:fff5:c6c3
>          vendor_id:                      0x02c9
>          vendor_part_id:                 4103
>          hw_ver:                         0x0
>          board_id:                       HP_1360110017
>          phys_port_cnt:                  2
>          Device ports:
>                  port:   1
>                          state:                  PORT_ACTIVE (4)
>                          max_mtu:                4096 (5)
>                          active_mtu:             1024 (3)
>                          sm_lid:                 0
>                          port_lid:               0
>                          port_lmc:               0x00
>                          link_layer:             Ethernet
>
>                  port:   2
>                          state:                  PORT_DOWN (1)
>                          max_mtu:                4096 (5)
>                          active_mtu:             1024 (3)
>                          sm_lid:                 0
>                          port_lid:               0
>                          port_lmc:               0x00
>                          link_layer:             Ethernet
>
> I will ask the openmpi mailing list if my results make sense?!
>
>
> Quoting Gus Correa <gus at ldeo.columbia.edu>:
>
> > Hi Faraz
> >
> > By all means, download the Open MPI tarball and build from source.
> > Otherwise there won't be support for IB (the CentOS Open MPI packages
> most
> > likely rely only on TCP/IP).
> >
> > Read their README file (it comes in the tarball), and take a careful look
> > at their (excellent) FAQ:
> > https://www.open-mpi.org/faq/
> > Many issues can be solved by just reading these two resources.
> >
> > If you hit more trouble, subscribe to the Open MPI mailing list, and ask
> > questions there,
> > because you will get advice directly from the Open MPI developers, and
> the
> > fix will come easy.
> > https://www.open-mpi.org/community/lists/ompi.php
> >
> > My two cents,
> > Gus Correa
> >
> > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain <info at feacluster.com>
> wrote:
> >
> >> Thanks, yes I have installed those libraries. See below. Initially I
> >> installed the libraries via yum. But then I tried installing the rpms
> >> directly from Mellanox website (
> >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
> >> that, I still got the same error with openmpi. I will try your
> >> suggestion of building openmpi from source next!
> >>
> >> root at lustwzb34:/root # yum list | grep ibverbs
> >> libibverbs.x86_64                     41mlnx1-OFED.4.5.0.1.0.45101
> >> libibverbs-devel.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
> >> libibverbs-devel-static.x86_64        41mlnx1-OFED.4.5.0.1.0.45101
> >> libibverbs-utils.x86_64               41mlnx1-OFED.4.5.0.1.0.45101
> >> libibverbs.i686                       17.2-3.el7
> >> rhel-7-server-rpms
> >> libibverbs-devel.i686                 1.2.1-1.el7
> >> rhel-7-server-rpms
> >>
> >> root at lustwzb34:/root # lsmod | grep ib
> >> ib_ucm                 22602  0
> >> ib_ipoib              168425  0
> >> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib
> >> ib_umad                22093  0
> >> mlx5_ib               339961  0
> >> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm
> >> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools
> >> mlx4_ib               211747  0
> >> ib_core               294554  10
> >>
> >>
> rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
> >> mlx4_core             360598  2 mlx4_en,mlx4_ib
> >> mlx_compat             29012  15
> >>
> >>
> rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
> >> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
> >> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
> >> root at lustwzb34:/root #
> >>
> >>
> >>
> >> > Did you install libibverbs  (and libibverbs-utils, for information and
> >> > troubleshooting)?
> >>
> >> > yum list |grep ibverbs
> >>
> >> > Are you loading the ib modules?
> >>
> >> > lsmod |grep ib
> >>
> >>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190430/413d6aef/attachment-0001.html>


More information about the Beowulf mailing list