<div dir="ltr">Faraz,<div><br></div><div>You can test your point to point rdma bandwidth as well.</div><div><br></div><div>On host lustwz99 run `qperf`</div><div>On any of the hosts lustwzb1-16 run `qperf lustwz99 -t 30 rc_lat rc_bi_bw`</div><div><br></div><div>Establish that you can pass traffic at expected speeds before going to the ipoib portion. </div><div><br></div><div>Also make sure that all of your node are running in the same mode, connected or datagram and that your MTU is the same on all nodes for that IP interface.</div><div><br></div><div>--Jeff</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Aug 2, 2017 at 10:50 AM, Faraz Hussain <span dir="ltr"><<a href="mailto:info@feacluster.com" target="_blank">info@feacluster.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Thanks Joe. Here is the output from the commands you suggested. We have open mpi built from Intel mpi compiler. Is there some benchmark code I can compile so that we are all comparing the same code?<br>
<br>
[hussaif1@lustwzb4 test]$ ibv_devinfo<br>
hca_id: mlx4_0<br>
        transport:                      InfiniBand (0)<br>
        fw_ver:                         2.11.550<br>
        node_guid:                      f452:1403:0016:3b70<br>
        sys_image_guid:                 f452:1403:0016:3b73<br>
        vendor_id:                      0x02c9<br>
        vendor_part_id:                 4099<br>
        hw_ver:                         0x0<br>
        board_id:                       DEL0A40000028<br>
        phys_port_cnt:                  2<br>
                port:   1<br>
                        state:                  PORT_ACTIVE (4)<br>
                        max_mtu:                4096 (5)<br>
                        active_mtu:             4096 (5)<br>
                        sm_lid:                 1<br>
                        port_lid:               3<br>
                        port_lmc:               0x00<br>
                        link_layer:             InfiniBand<br>
<br>
                port:   2<br>
                        state:                  PORT_DOWN (1)<br>
                        max_mtu:                4096 (5)<br>
                        active_mtu:             4096 (5)<br>
                        sm_lid:                 0<br>
                        port_lid:               0<br>
                        port_lmc:               0x00<br>
                        link_layer:             InfiniBand<div><div class="h5"><br>
<br>
[hussaif1@lustwzb4 test]$ ibstat<br>
CA 'mlx4_0'<br>
        CA type: MT4099<br>
        Number of ports: 2<br>
        Firmware version: 2.11.550<br>
        Hardware version: 0<br>
        Node GUID: 0xf452140300163b70<br>
        System image GUID: 0xf452140300163b73<br>
        Port 1:<br>
                State: Active<br>
                Physical state: LinkUp<br>
                Rate: 40 (FDR10)<br>
                Base lid: 3<br>
                LMC: 0<br>
                SM lid: 1<br>
                Capability mask: 0x02514868<br>
                Port GUID: 0xf452140300163b71<br>
                Link layer: InfiniBand<br>
        Port 2:<br>
                State: Down<br>
                Physical state: Disabled<br>
                Rate: 10<br>
                Base lid: 0<br>
                LMC: 0<br>
                SM lid: 0<br>
                Capability mask: 0x02514868<br>
                Port GUID: 0xf452140300163b72<br>
                Link layer: InfiniBand<br>
<br></div></div>
[hussaif1@lustwzb4 test]$ ibstatus<br>
Infiniband device 'mlx4_0' port 1 status:<br>
        default gid:     fe80:0000:0000:0000:f452:<wbr>1403:0016:3b71<br>
        base lid:        0x3<br>
        sm lid:          0x1<br>
        state:           4: ACTIVE<br>
        phys state:      5: LinkUp<br>
        rate:            40 Gb/sec (4X FDR10)<br>
        link_layer:      InfiniBand<br>
<br>
Infiniband device 'mlx4_0' port 2 status:<br>
        default gid:     fe80:0000:0000:0000:f452:<wbr>1403:0016:3b72<br>
        base lid:        0x0<br>
        sm lid:          0x0<br>
        state:           1: DOWN<br>
        phys state:      3: Disabled<br>
        rate:            10 Gb/sec (4X)<br>
        link_layer:      InfiniBand<div class="HOEnZb"><div class="h5"><br>
<br>
<br>
Quoting Joe Landman <<a href="mailto:joe.landman@gmail.com" target="_blank">joe.landman@gmail.com</a>>:<br>
<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
start with<br>
<br>
    ibv_devinfo<br>
<br>
    ibstat<br>
<br>
    ibstatus<br>
<br>
<br>
and see what (if anything) they report.<br>
<br>
Second, how did you compile/run your MPI code?<br>
<br>
<br>
On 08/02/2017 12:44 PM, Faraz Hussain wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
I have inherited a 20-node cluster that supposedly has an infiniband network. I am testing some mpi applications and am seeing no performance improvement with multiple nodes. So I am wondering if the Infiband network even works?<br>
<br>
The output of ifconfig -a shows an ib0 and ib1 network. I ran ethtools ib0 and it shows:<br>
<br>
       Speed: 40000Mb/s<br>
       Link detected: no<br>
<br>
and for ib1 it show:<br>
<br>
       Speed: 10000Mb/s<br>
       Link detected: no<br>
<br>
I am assuming this means it is down? Any idea how to debug further and restart it?<br>
<br>
Thanks!<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
</blockquote>
<br>
-- <br>
Joe Landman<br>
e: <a href="mailto:joe.landman@gmail.com" target="_blank">joe.landman@gmail.com</a><br>
t: @hpcjoe<br>
w: <a href="https://scalability.org" rel="noreferrer" target="_blank">https://scalability.org</a><br>
g: <a href="https://github.com/joelandman" rel="noreferrer" target="_blank">https://github.com/joelandman</a><br>
l: <a href="https://www.linkedin.com/in/joelandman" rel="noreferrer" target="_blank">https://www.linkedin.com/in/jo<wbr>elandman</a><br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
</blockquote>
<br>
<br>
<br>
______________________________<wbr>_________________<br>
Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>
To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman<wbr>/listinfo/beowulf</a><br>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">------------------------------<br>Jeff Johnson<br>Co-Founder<br>Aeon Computing<br><br><a href="mailto:jeff.johnson@aeoncomputing.com" target="_blank">jeff.johnson@aeoncomputing.com</a><br><a href="http://www.aeoncomputing.com" target="_blank">www.aeoncomputing.com</a><br>t: 858-412-3810 x1001   f: 858-412-3845<br>m: 619-204-9061<br><br>4170 Morena Boulevard, Suite D - San Diego, CA 92117<div><br></div><div>High-Performance Computing / Lustre Filesystems / Scale-out Storage</div></div></div>
</div>