[Beowulf] How to know if infiniband network works?

Gus Correa gus at ldeo.columbia.edu
Wed Aug 2 11:25:09 PDT 2017


Hi Faraz

The output of lsmod looks good to me.
It shows that you have verbs, rdma, etc.
Presumably this happens in all nodes (the output you sent
is likely to be in one node, lustwzb4 or something like that).

ompi-info shows that Open MPI was built with openib (Infinband)
support. So, another good thing.
Therefore, by default Open MPI will try to use Inifinband,
unless one of the nodes' IB card has a problem,
or the IB kernel modules were not loaded, etc.
But you shouldn't worry about it until it happens.


I think ibhosts is just telling you that the NICs
have two ports ("ports 2", with a space in between).

Also, check the back of the nodes for the IB cable connections.
They're thick cables, should be connected to the IB switch.
You will *probably* find two IB ports in the nodes, with only
one connected. At least that is what your ifconfig output suggests.

ibstat runs only on the node you're in.
If you have a tool such as pdsh (parallel shell),
you can use it to run ibstat on all nodes.
Or just ssh to each node and run ibstat.

Anyway, I don't see any red flag or problem.
[Well, unless somebody else spots something that I haven't seen,
which is *very* possible.]
It seems to be good to go to run MPI (Open MPI) programs
using Infinband.


********

Now some items a bit out of topic, not a specific answer to your 
question, but hopefully they may help.

1) pdsh

Do you have a head/master node in the cluster?
Is it lustwzb99 perhaps?
You could run pdsh from there.
It is very helpful for cluster-wide checks, etc.
(You can install it if not there, sometimes there
is also "dsh" already installed, although older.)

https://sourceforge.net/projects/pdsh/

[It may be available as package (rpm or similar)
for your Linux distribution also.]

2) Open MPI details and customization

I'd suggest that you take a look at the Open MPI FAQ,
for more details, specially how to control things at runtime.
They have zillions of "MCA parameters" that allow a lot of 
customization, if you care:

https://www.open-mpi.org/faq/

Their README file (you can get it in their tarball) is also
a good source of information.

3) Resource managers and integration with Open MPI

Also, if you have a "resource manager" (a.k.a. job queue system),
such as Torque/PBS, Slurm, SGE, you may want to look into integrating
it with Open MPI (if it is not already this way), and how to
set up the job scripts to take advantage of that integration.
The Open MPI FAQs have some material on this (and the Open MPI README 
file also), but you may need to consult the "resource manager" 
documentation as well. [If you're using Torque start with "man qsub".]


4) Open MPI installation: NFS vs. local

You may need to check if Open MPI is installed, say,
in an NFS shared directory, visible to all nodes,
or perhaps installed via package (RPM or similar) on
all nodes.
In the latter case, make sure you have the same exact
version (including the compiler that was used to build it) everywhere.
Installing on NFS makes life easier on small clusters (for updates, etc).
Make sure the NFS directory is exported/mounted to/by all nodes.

5) Environment variables and "envrionment modules" package

You may need also to set some environment variables (such as PATH and 
LD_LIBRARY_PATH) to ensure that Open MPI (and any other software) works.
The simplest way is brute force in the .bashrc/.tcshrc initialization 
files.

However, I'd recommend taking a look at the "environment modules"
package, that provides a much cleaner solution, and makes it easy
for users to switch from one compiler to another, from one version
of Open MPI to another, etc.
If you provide a variety of versions of software, that is a must.

http://modules.sourceforge.net/


[Available as package in many Linux distros.]

**

I hope this helps,
Gus Correa

On 08/02/2017 01:37 PM, Faraz Hussain wrote:
> Thanks for the tips. We have openmpi installed. Here is some relevant 
> output from the commands you suggested. One confusing thing is ibstat 
> shows only port 1 as active. But ibhosts shows port 2 only.
> 
> [hussaif1 at lustwzb4 test]$ lsmod | grep ib
> ib_ucm                 12120  0
> ib_ipoib              114971  0
> ib_cm                  42214  3 ib_ucm,rdma_cm,ib_ipoib
> ib_uverbs              50244  2 rdma_ucm,ib_ucm
> ib_umad                12562  0
> mlx5_ib               103326  0
> mlx5_core              85201  1 mlx5_ib
> mlx4_ib               164865  0
> ib_sa                  24170  5 rdma_ucm,rdma_cm,ib_ipoib,ib_cm,mlx4_ib
> ib_mad                 43241  4 ib_cm,ib_umad,mlx4_ib,ib_sa
> ib_core                95458  12 
> rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx4_ib,ib_sa,ib_mad 
> 
> ib_addr                 7732  3 rdma_cm,ib_uverbs,ib_core
> ipv6                  317829  145 ib_ipoib,mlx4_ib,ib_addr
> mlx4_core             258183  2 mlx4_en,mlx4_ib
> compat                 23876  17 
> rdma_ucm,ib_ucm,rdma_cm,iw_cm,ib_ipoib,ib_cm,ib_uverbs,ib_umad,mlx5_ib,mlx5_core,mlx4_en,mlx4_ib,ib_sa,ib_mad,ib_core,ib_addr,mlx4_core 
> 
> libcrc32c               1246  1 bnx2x
> 
> [hussaif1 at lustwzb4 test]$ ompi_info | grep ib
> 
> MCA btl: openib (MCA v2.0, API v2.0, Component v1.8.4)
> 
> [hussaif1 at lustwzb4 test]$ ibstat
> CA 'mlx4_0'
>          CA type: MT4099
>          Number of ports: 2
>          Firmware version: 2.11.550
>          Hardware version: 0
>          Node GUID: 0xf452140300163b70
>          System image GUID: 0xf452140300163b73
>          Port 1:
>                  State: Active
>                  Physical state: LinkUp
>                  Rate: 40 (FDR10)
>                  Base lid: 3
>                  LMC: 0
>                  SM lid: 1
>                  Capability mask: 0x02514868
>                  Port GUID: 0xf452140300163b71
>                  Link layer: InfiniBand
>          Port 2:
>                  State: Down
>                  Physical state: Disabled
>                  Rate: 10
>                  Base lid: 0
>                  LMC: 0
>                  SM lid: 0
>                  Capability mask: 0x02514868
>                  Port GUID: 0xf452140300163b72
>                  Link layer: InfiniBand
> 
> [hussaif1 at lustwzb4 test]$ ibhosts
> Ca      : 0xf45214030015bf60 ports 2 "lustwzb9 HCA-1"
> Ca      : 0xf45214030015c0e0 ports 2 "lustwzb16 HCA-1"
> Ca      : 0xf452140300163e20 ports 2 "lustwzb15 HCA-1"
> Ca      : 0xf45214030015c080 ports 2 "lustwzb14 HCA-1"
> Ca      : 0xf45214030015c290 ports 2 "lustwzb13 HCA-1"
> Ca      : 0xf45214030015bf70 ports 2 "lustwzb12 HCA-1"
> Ca      : 0xf452140300163bb0 ports 2 "lustwzb11 HCA-1"
> Ca      : 0xf452140300163c70 ports 2 "lustwzb10 HCA-1"
> Ca      : 0xf452140300163e30 ports 2 "lustwzb8 HCA-1"
> Ca      : 0xf452140300163b80 ports 2 "lustwzb7 HCA-1"
> Ca      : 0xf452140300163ba0 ports 2 "lustwzb6 HCA-1"
> Ca      : 0xf45214030015bfb0 ports 2 "lustwzb5 HCA-1"
> Ca      : 0xf45214030015bf90 ports 2 "lustwzb3 HCA-1"
> Ca      : 0xf452140300163df0 ports 2 "lustwzb2 HCA-1"
> Ca      : 0xf45214030015c0a0 ports 2 "lustwzb1 HCA-1"
> Ca      : 0x0002c90300b78240 ports 1 "lustwz99 HCA-1"
> Ca      : 0xf452140300163b70 ports 2 "lustwzb4 HCA-1"
> 
> [hussaif1 at lustwzb4 test]$ ibnetdiscover
> #
> # Topology file: generated on Wed Aug  2 13:24:40 2017
> #
> # Initiated from node f452140300163b70 port f452140300163b71
> 
> vendid=0x2c9
> devid=0xc738
> sysimgguid=0x2c9030089cab0
> switchguid=0x2c9030089cab0(2c9030089cab0)
> Switch  32 "S-0002c9030089cab0"         # "SwitchX -  Mellanox 
> Technologies" base port 0 lid 2 lmc 0
> [16]    "H-0002c90300b78240"[1](2c90300b78241)          # "lustwz99 
> HCA-1" lid 1 4xFDR10
> [17]    "H-f45214030015c0a0"[1](f45214030015c0a1)               # 
> "lustwzb1 HCA-1" lid 5 4xFDR10
> [18]    "H-f452140300163df0"[1](f452140300163df1)               # 
> "lustwzb2 HCA-1" lid 6 4xFDR10
> [19]    "H-f45214030015bf90"[1](f45214030015bf91)               # 
> "lustwzb3 HCA-1" lid 4 4xFDR10
> [20]    "H-f452140300163b70"[1](f452140300163b71)               # 
> "lustwzb4 HCA-1" lid 3 4xFDR10
> [21]    "H-f45214030015bfb0"[1](f45214030015bfb1)               # 
> "lustwzb5 HCA-1" lid 7 4xFDR10
> [22]    "H-f452140300163ba0"[1](f452140300163ba1)               # 
> "lustwzb6 HCA-1" lid 8 4xFDR10
> [23]    "H-f452140300163b80"[1](f452140300163b81)               # 
> "lustwzb7 HCA-1" lid 9 4xFDR10
> [24]    "H-f452140300163e30"[1](f452140300163e31)               # 
> "lustwzb8 HCA-1" lid 10 4xFDR10
> [25]    "H-f45214030015bf60"[1](f45214030015bf61)               # 
> "lustwzb9 HCA-1" lid 11 4xFDR10
> [26]    "H-f452140300163c70"[1](f452140300163c71)               # 
> "lustwzb10 HCA-1" lid 12 4xFDR10
> [27]    "H-f452140300163bb0"[1](f452140300163bb1)               # 
> "lustwzb11 HCA-1" lid 13 4xFDR10
> [28]    "H-f45214030015bf70"[1](f45214030015bf71)               # 
> "lustwzb12 HCA-1" lid 14 4xFDR10
> [29]    "H-f45214030015c290"[1](f45214030015c291)               # 
> "lustwzb13 HCA-1" lid 15 4xFDR10
> [30]    "H-f45214030015c080"[1](f45214030015c081)               # 
> "lustwzb14 HCA-1" lid 16 4xFDR10
> [31]    "H-f452140300163e20"[1](f452140300163e21)               # 
> "lustwzb15 HCA-1" lid 17 4xFDR10
> [32]    "H-f45214030015c0e0"[1](f45214030015c0e1)               # 
> "lustwzb16 HCA-1" lid 18 4xFDR10
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015c0e3
> caguid=0xf45214030015c0e0
> Ca      2 "H-f45214030015c0e0"          # "lustwzb16 HCA-1"
> [1](f45214030015c0e1)   "S-0002c9030089cab0"[32]                # lid 18 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163e23
> caguid=0xf452140300163e20
> Ca      2 "H-f452140300163e20"          # "lustwzb15 HCA-1"
> [1](f452140300163e21)   "S-0002c9030089cab0"[31]                # lid 17 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015c083
> caguid=0xf45214030015c080
> Ca      2 "H-f45214030015c080"          # "lustwzb14 HCA-1"
> [1](f45214030015c081)   "S-0002c9030089cab0"[30]                # lid 16 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015bf73
> caguid=0xf45214030015bf70
> Ca      2 "H-f45214030015bf70"          # "lustwzb12 HCA-1"
> [1](f45214030015bf71)   "S-0002c9030089cab0"[28]                # lid 14 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015c293
> caguid=0xf45214030015c290
> Ca      2 "H-f45214030015c290"          # "lustwzb13 HCA-1"
> [1](f45214030015c291)   "S-0002c9030089cab0"[29]                # lid 15 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015bf63
> caguid=0xf45214030015bf60
> Ca      2 "H-f45214030015bf60"          # "lustwzb9 HCA-1"
> [1](f45214030015bf61)   "S-0002c9030089cab0"[25]                # lid 11 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163bb3
> caguid=0xf452140300163bb0
> Ca      2 "H-f452140300163bb0"          # "lustwzb11 HCA-1"
> [1](f452140300163bb1)   "S-0002c9030089cab0"[27]                # lid 13 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163c73
> caguid=0xf452140300163c70
> Ca      2 "H-f452140300163c70"          # "lustwzb10 HCA-1"
> [1](f452140300163c71)   "S-0002c9030089cab0"[26]                # lid 12 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163e33
> caguid=0xf452140300163e30
> Ca      2 "H-f452140300163e30"          # "lustwzb8 HCA-1"
> [1](f452140300163e31)   "S-0002c9030089cab0"[24]                # lid 10 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163b83
> caguid=0xf452140300163b80
> Ca      2 "H-f452140300163b80"          # "lustwzb7 HCA-1"
> [1](f452140300163b81)   "S-0002c9030089cab0"[23]                # lid 9 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015bfb3
> caguid=0xf45214030015bfb0
> Ca      2 "H-f45214030015bfb0"          # "lustwzb5 HCA-1"
> [1](f45214030015bfb1)   "S-0002c9030089cab0"[21]                # lid 7 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163ba3
> caguid=0xf452140300163ba0
> Ca      2 "H-f452140300163ba0"          # "lustwzb6 HCA-1"
> [1](f452140300163ba1)   "S-0002c9030089cab0"[22]                # lid 8 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163df3
> caguid=0xf452140300163df0
> Ca      2 "H-f452140300163df0"          # "lustwzb2 HCA-1"
> [1](f452140300163df1)   "S-0002c9030089cab0"[18]                # lid 6 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015bf93
> caguid=0xf45214030015bf90
> Ca      2 "H-f45214030015bf90"          # "lustwzb3 HCA-1"
> [1](f45214030015bf91)   "S-0002c9030089cab0"[19]                # lid 4 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf45214030015c0a3
> caguid=0xf45214030015c0a0
> Ca      2 "H-f45214030015c0a0"          # "lustwzb1 HCA-1"
> [1](f45214030015c0a1)   "S-0002c9030089cab0"[17]                # lid 5 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0x2c90300b78243
> caguid=0x2c90300b78240
> Ca      1 "H-0002c90300b78240"          # "lustwz99 HCA-1"
> [1](2c90300b78241)      "S-0002c9030089cab0"[16]                # lid 1 
> lmc 0 "SwitchX -  Mellanox Technologies" lid 2 4xFDR10
> 
> vendid=0x2c9
> devid=0x1003
> sysimgguid=0xf452140300163b73
> caguid=0xf452140300163b70
> Ca      2 "H-f452140300163b70"          # "lustwzb4 HCA-1"
> [1](f452140300163b71)   "S-0002c9030089cab0"[20]
> 
> Quoting Gus Correa <gus at ldeo.columbia.edu>:
> 
>> Hi Faraz
>>
>> 1) lsmod | grep ib should show if the Infinband kernel modules are 
>> loaded.
>>
>> 2) Infinband normally uses remote DMA (rdma) through "verbs".
>> You should see an "ib" module with "verbs" in the name.
>> That is the preferred/faster mode for MPI.
>>
>> 3) However, you can also use Infinband for TCP/IP (slower).
>> As the output of your ifconfig shows, your ib0 interface is
>> also configured for TCP/IP.
>>
>> 4) You may have two interfaces (one card with two or two cards) in the 
>> nodes. One may not be connected to a switch (ib1). Check the back of 
>> your nodes.
>>
>> 5) To check if MPI is using it, depends a bit on which MPI library
>> you're using.
>> Which one? Open MPI, MVAPICH2, some vendor/proprietary one?
>> If it is Open MPI the command "ompi-info" will tell.
>> With Open MPI there are also ways to enable/disable
>> Infiniband at runtime.
>>
>> 6) Some Infinband diagnostics may also help (normally in /usr/sbin)
>>
>> ibstat
>> ibhosts
>> ibnetdiscover
>>
>> etc
>>
>> OK, this is my pedestrian view of Infinband.
>> Now let's hear the experts in the list for deeper insights. :)
>>
>> I hope this helps,
>> Gus Correa
>>
>>
>> On 08/02/2017 12:44 PM, Faraz Hussain wrote:
>>> I have inherited a 20-node cluster that supposedly has an infiniband 
>>> network. I am testing some mpi applications and am seeing no 
>>> performance improvement with multiple nodes. So I am wondering if the 
>>> Infiband network even works?
>>>
>>> The output of ifconfig -a shows an ib0 and ib1 network. I ran 
>>> ethtools ib0 and it shows:
>>>
>>>         Speed: 40000Mb/s
>>>         Link detected: no
>>>
>>> and for ib1 it show:
>>>
>>>         Speed: 10000Mb/s
>>>         Link detected: no
>>>
>>> I am assuming this means it is down? Any idea how to debug further 
>>> and restart it?
>>>
>>> Thanks!
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
> 
> 
> 
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
> 



More information about the Beowulf mailing list