[Beowulf] How to debug error with Open MPI 3 / Mellanox / Red Hat?

Benson Muite benson_muite at emailplus.org
Wed May 1 01:38:35 PDT 2019


Hi Faraz,

Have you tried any other MPI distributions (eg. MPICH, MVAPICH)?

Regards,

Benson

On 4/30/19 11:20 PM, Gus Correa wrote:
> It may be using IPoIB (TCP/IP over IB), not verbs/rdma.
> You can force it to use openib (verbs, rdma) with (vader is for 
> in-node shared memory):
>
> mpirun  --mca  btl openib,self,vader ...
>
> These flags may also help tell which btl (byte transport layer) is being used:
>
>   |--mca btl_base_verbose 30|
> See these FAQ:
> https://www.open-mpi.org/faq/?category=openfabrics#ib-btl
> https://www.open-mpi.org/faq/?category=all#tcp-routability-1.3
> Better really ask more details in the Open MPI list. They are the pros!
> My two cents, Gus Correa
>
>
> On Tue, Apr 30, 2019 at 3:57 PM Faraz Hussain <info at feacluster.com 
> <mailto:info at feacluster.com>> wrote:
>
>     Thanks, after buidling openmpi 4 from source, it now works!
>     However it
>     still gives this message below when I run openmpi with verbose
>     setting:
>
>     No OpenFabrics connection schemes reported that they were able to be
>     used on a specific port.  As such, the openib BTL (OpenFabrics
>     support) will be disabled for this port.
>
>        Local host:           lustwzb34
>        Local device:         mlx4_0
>        Local port:           1
>        CPCs attempted:       rdmacm, udcm
>
>     However, the results from my latency and bandwith tests seem to be
>     what I would expect from infiniband. See:
>
>     [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>     ./osu_latency
>     # OSU MPI Latency Test v5.3.2
>     # Size          Latency (us)
>     0                       1.87
>     1                       1.88
>     2                       1.93
>     4                       1.92
>     8                       1.93
>     16                      1.95
>     32                      1.93
>     64                      2.08
>     128                     2.61
>     256                     2.72
>     512                     2.93
>     1024                    3.33
>     2048                    3.81
>     4096                    4.71
>     8192                    6.68
>     16384                   8.38
>     32768                  12.13
>     65536                  19.74
>     131072                 35.08
>     262144                 64.67
>     524288                122.11
>     1048576               236.69
>     2097152               465.97
>     4194304               926.31
>
>     [hussaif1 at lustwzb34 pt2pt]$  mpirun -v -np 2 -hostfile ./hostfile
>     ./osu_bw
>     # OSU MPI Bandwidth Test v5.3.2
>     # Size      Bandwidth (MB/s)
>     1                       3.09
>     2                       6.35
>     4                      12.77
>     8                      26.01
>     16                     51.31
>     32                    103.08
>     64                    197.89
>     128                   362.00
>     256                   676.28
>     512                  1096.26
>     1024                 1819.25
>     2048                 2551.41
>     4096                 3886.63
>     8192                 3983.17
>     16384                4362.30
>     32768                4457.09
>     65536                4502.41
>     131072               4512.64
>     262144               4531.48
>     524288               4537.42
>     1048576              4510.69
>     2097152              4546.64
>     4194304              4565.12
>
>     When I run ibv_devinfo I get:
>
>     [hussaif1 at lustwzb34 pt2pt]$ ibv_devinfo
>     hca_id: mlx4_0
>              transport:                      InfiniBand (0)
>              fw_ver:                         2.36.5000
>              node_guid:                      480f:cfff:fff5:c6c0
>              sys_image_guid:                 480f:cfff:fff5:c6c3
>              vendor_id:                      0x02c9
>              vendor_part_id:                 4103
>              hw_ver:                         0x0
>              board_id:                       HP_1360110017
>              phys_port_cnt:                  2
>              Device ports:
>                      port:   1
>                              state:                  PORT_ACTIVE (4)
>                              max_mtu:                4096 (5)
>                              active_mtu:             1024 (3)
>                              sm_lid:                 0
>                              port_lid:               0
>                              port_lmc:               0x00
>                              link_layer:             Ethernet
>
>                      port:   2
>                              state:                  PORT_DOWN (1)
>                              max_mtu:                4096 (5)
>                              active_mtu:             1024 (3)
>                              sm_lid:                 0
>                              port_lid:               0
>                              port_lmc:               0x00
>                              link_layer:             Ethernet
>
>     I will ask the openmpi mailing list if my results make sense?!
>
>
>     Quoting Gus Correa <gus at ldeo.columbia.edu
>     <mailto:gus at ldeo.columbia.edu>>:
>
>     > Hi Faraz
>     >
>     > By all means, download the Open MPI tarball and build from source.
>     > Otherwise there won't be support for IB (the CentOS Open MPI
>     packages most
>     > likely rely only on TCP/IP).
>     >
>     > Read their README file (it comes in the tarball), and take a
>     careful look
>     > at their (excellent) FAQ:
>     > https://www.open-mpi.org/faq/
>     > Many issues can be solved by just reading these two resources.
>     >
>     > If you hit more trouble, subscribe to the Open MPI mailing list,
>     and ask
>     > questions there,
>     > because you will get advice directly from the Open MPI
>     developers, and the
>     > fix will come easy.
>     > https://www.open-mpi.org/community/lists/ompi.php
>     >
>     > My two cents,
>     > Gus Correa
>     >
>     > On Tue, Apr 30, 2019 at 3:07 PM Faraz Hussain
>     <info at feacluster.com <mailto:info at feacluster.com>> wrote:
>     >
>     >> Thanks, yes I have installed those libraries. See below.
>     Initially I
>     >> installed the libraries via yum. But then I tried installing
>     the rpms
>     >> directly from Mellanox website (
>     >> MLNX_OFED_LINUX-4.5-1.0.1.0-rhel7.5-x86_64.tar ). Even after doing
>     >> that, I still got the same error with openmpi. I will try your
>     >> suggestion of building openmpi from source next!
>     >>
>     >> root at lustwzb34:/root # yum list | grep ibverbs
>     >> libibverbs.x86_64  41mlnx1-OFED.4.5.0.1.0.45101
>     >> libibverbs-devel.x86_64  41mlnx1-OFED.4.5.0.1.0.45101
>     >> libibverbs-devel-static.x86_64 41mlnx1-OFED.4.5.0.1.0.45101
>     >> libibverbs-utils.x86_64  41mlnx1-OFED.4.5.0.1.0.45101
>     >> libibverbs.i686                       17.2-3.el7
>     >> rhel-7-server-rpms
>     >> libibverbs-devel.i686                 1.2.1-1.el7
>     >> rhel-7-server-rpms
>     >>
>     >> root at lustwzb34:/root # lsmod | grep ib
>     >> ib_ucm                 22602  0
>     >> ib_ipoib              168425  0
>     >> ib_cm                  53141  3 rdma_cm,ib_ucm,ib_ipoib
>     >> ib_umad                22093  0
>     >> mlx5_ib               339961  0
>     >> ib_uverbs             121821  3 mlx5_ib,ib_ucm,rdma_ucm
>     >> mlx5_core             919178  2 mlx5_ib,mlx5_fpga_tools
>     >> mlx4_ib               211747  0
>     >> ib_core               294554  10
>     >>
>     >>
>     rdma_cm,ib_cm,iw_cm,mlx4_ib,mlx5_ib,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
>     >> mlx4_core             360598  2 mlx4_en,mlx4_ib
>     >> mlx_compat             29012  15
>     >>
>     >>
>     rdma_cm,ib_cm,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,mlx5_fpga_tools,ib_ucm,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
>     >> devlink                42368  4 mlx4_en,mlx4_ib,mlx4_core,mlx5_core
>     >> libcrc32c              12644  3 xfs,nf_nat,nf_conntrack
>     >> root at lustwzb34:/root #
>     >>
>     >>
>     >>
>     >> > Did you install libibverbs  (and libibverbs-utils, for
>     information and
>     >> > troubleshooting)?
>     >>
>     >> > yum list |grep ibverbs
>     >>
>     >> > Are you loading the ib modules?
>     >>
>     >> > lsmod |grep ib
>     >>
>     >>
>
>
>
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190501/09be46c4/attachment-0001.html>


More information about the Beowulf mailing list