[Beowulf] IB problem/using IB diagnostics

Prentice Bisbal prentice at ias.edu
Thu Jun 18 10:48:16 PDT 2009


Joe Landman wrote:
> Prentice Bisbal wrote:
>> One of my nodes has an IB problem:
>>
>> --------------------------------------------------------------------------
>>
>> WARNING: There is at least on IB HCA found on host 'node36.aurora', but
>> there is
>> no active ports detected. This is most certainly not what you wanted.
>> Check your cables and SM configuration.
>> --------------------------------------------------------------------------
>>
>>
>>
>> I've been trying to use some of the IB diagnostics tools to diagnose the
>> problem, but I can't seem to figure out the correct syntax of the
>> commands. On a node that I know is working correctly (the master node):
>>
> 
> Sanity check:
> 
>   ibnodes or ibhosts report what?
> 
>   This will test the connection, session manager, and a few other
> things.  If you get nothing back then things are problematic
> 
>   ibdiagnet is also useful for status of the subnets.
> 

The output of these 3 commands is below. I don't see anything really
unusual. Sorry for the long e-mail, and thanks for the help.

[root at aurora ~]# ibnodes
Ca      : 0x0005ad00001ffb70 ports 2 "node64 HCA-1"
Ca      : 0x0005ad00001ffaec ports 2 "node63 HCA-1"
Ca      : 0x0005ad00001ffb1c ports 2 "node62 HCA-1"
Ca      : 0x0005ad00001ffab8 ports 2 "node61 HCA-1"
Ca      : 0x0005ad00001ffa90 ports 2 "node60 HCA-1"
Ca      : 0x0005ad00001ffae0 ports 2 "node59 HCA-1"
Ca      : 0x0005ad00001ffb30 ports 2 "node58 HCA-1"
Ca      : 0x0005ad00001ffba4 ports 2 "node57 HCA-1"
Ca      : 0x0005ad00001ffb2c ports 2 "node56 HCA-1"
Ca      : 0x0005ad00001ffb74 ports 2 "node55 HCA-1"
Ca      : 0x0005ad00001ff990 ports 2 "node54 HCA-1"
Ca      : 0x0005ad00001ffb58 ports 2 "node53 HCA-1"
Ca      : 0x0005ad00001ff964 ports 2 "node52 HCA-1"
Ca      : 0x0005ad00001ff99c ports 2 "node51 HCA-1"
Ca      : 0x0005ad00001ffb54 ports 2 "node50 HCA-1"
Ca      : 0x0005ad00001ffb48 ports 2 "node49 HCA-1"
Ca      : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1"
Ca      : 0x0005ad00001ffb80 ports 2 "node47 HCA-1"
Ca      : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1"
Ca      : 0x0005ad00001ffb44 ports 2 "node45 HCA-1"
Ca      : 0x0005ad00001ffb94 ports 2 "node44 HCA-1"
Ca      : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1"
Ca      : 0x0005ad00001ffb40 ports 2 "node42 HCA-1"
Ca      : 0x0005ad00001ffb24 ports 2 "node41 HCA-1"
Ca      : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1"
Ca      : 0x0005ad00001ff968 ports 2 "node39 HCA-1"
Ca      : 0x0005ad00001ffb6c ports 2 "node38 HCA-1"
Ca      : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1"
Ca      : 0x0005ad00001ff970 ports 2 "node36 HCA-1"
Ca      : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1"
Ca      : 0x0005ad00001ffaac ports 2 "node34 HCA-1"
Ca      : 0x0005ad00001ffb20 ports 2 "node33 HCA-1"
Ca      : 0x0005ad00001ffa34 ports 2 "node12 HCA-1"
Ca      : 0x0005ad00001ffa68 ports 2 "node11 HCA-1"
Ca      : 0x0005ad00001ff978 ports 2 "node10 HCA-1"
Ca      : 0x0005ad00001ff974 ports 2 "node09 HCA-1"
Ca      : 0x0005ad00001ffb34 ports 2 "node08 HCA-1"
Ca      : 0x0005ad00001ffa48 ports 2 "node07 HCA-1"
Ca      : 0x0005ad00001ff98c ports 2 "node06 HCA-1"
Ca      : 0x0005ad00001ff994 ports 2 "node05 HCA-1"
Ca      : 0x0005ad00001ff988 ports 2 "node04 HCA-1"
Ca      : 0x0005ad00001ffa3c ports 2 "node03 HCA-1"
Ca      : 0x0005ad00001ff9ac ports 2 "node02 HCA-1"
Ca      : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1"
Ca      : 0x0005ad00001ff96c ports 2 "node24 HCA-1"
Ca      : 0x0005ad00001ffb60 ports 2 "node23 HCA-1"
Ca      : 0x0005ad00001ffa38 ports 2 "node22 HCA-1"
Ca      : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1"
Ca      : 0x0005ad00001ffb4c ports 2 "node20 HCA-1"
Ca      : 0x0005ad00001ffa40 ports 2 "node19 HCA-1"
Ca      : 0x0005ad00001ff998 ports 2 "node18 HCA-1"
Ca      : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1"
Ca      : 0x0005ad00001ffa9c ports 2 "node16 HCA-1"
Ca      : 0x0005ad00001ffae8 ports 2 "node15 HCA-1"
Ca      : 0x0005ad00001ffb3c ports 2 "node14 HCA-1"
Ca      : 0x0005ad00001ff984 ports 2 "node13 HCA-1"
Ca      : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1"
Ca      : 0x0005ad00001ff97c ports 2 "node31 HCA-1"
Ca      : 0x0005ad00001ff980 ports 2 "node30 HCA-1"
Ca      : 0x0005ad00001ffab0 ports 2 "node29 HCA-1"
Ca      : 0x0005ad00001ffafc ports 2 "node28 HCA-1"
Ca      : 0x0005ad00001ffb64 ports 2 "node27 HCA-1"
Ca      : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1"
Ca      : 0x0005ad00001ffb50 ports 2 "node25 HCA-1"
Ca      : 0x0005ad00001ffb28 ports 2 "aurora HCA-1"
Switch  : 0x0005ad00070435e2 ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 6, Chip A" base port 0 lid 3 lmc 0
Switch  : 0x0005ad10060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 1, Chip B" base port 0 lid 9 lmc 0
Switch  : 0x0005ad100604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 3, Chip B" base port 0 lid 8 lmc 0
Switch  : 0x0005ad100604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 2, Chip B" base port 0 lid 7 lmc 0
Switch  : 0x0005ad00060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 1, Chip A" enhanced port 0 lid 6 lmc 0
Switch  : 0x0005ad000604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 3, Chip A" base port 0 lid 5 lmc 0
Switch  : 0x0005ad000604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Spine 2, Chip A" enhanced port 0 lid 4 lmc 0
Switch  : 0x0005ad000704348e ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 4, Chip A" base port 0 lid 14 lmc 0
Switch  : 0x0005ad000704348c ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 2, Chip A" base port 0 lid 13 lmc 0
Switch  : 0x0005ad000704348a ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 1, Chip A" base port 0 lid 11 lmc 0
Switch  : 0x0005ad0007043489 ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 3, Chip A" base port 0 lid 10 lmc 0
Switch  : 0x0005ad00070435d4 ports 24 "SFS-7012 GUID=0x0005ad00020434bd
Leaf 5, Chip A" base port 0 lid 12 lmc 0

[root at aurora ~]# ibhosts
Ca      : 0x0005ad00001ffb70 ports 2 "node64 HCA-1"
Ca      : 0x0005ad00001ffaec ports 2 "node63 HCA-1"
Ca      : 0x0005ad00001ffb1c ports 2 "node62 HCA-1"
Ca      : 0x0005ad00001ffab8 ports 2 "node61 HCA-1"
Ca      : 0x0005ad00001ffa90 ports 2 "node60 HCA-1"
Ca      : 0x0005ad00001ffae0 ports 2 "node59 HCA-1"
Ca      : 0x0005ad00001ffb30 ports 2 "node58 HCA-1"
Ca      : 0x0005ad00001ffba4 ports 2 "node57 HCA-1"
Ca      : 0x0005ad00001ffb2c ports 2 "node56 HCA-1"
Ca      : 0x0005ad00001ffb74 ports 2 "node55 HCA-1"
Ca      : 0x0005ad00001ff990 ports 2 "node54 HCA-1"
Ca      : 0x0005ad00001ffb58 ports 2 "node53 HCA-1"
Ca      : 0x0005ad00001ff964 ports 2 "node52 HCA-1"
Ca      : 0x0005ad00001ff99c ports 2 "node51 HCA-1"
Ca      : 0x0005ad00001ffb54 ports 2 "node50 HCA-1"
Ca      : 0x0005ad00001ffb48 ports 2 "node49 HCA-1"
Ca      : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1"
Ca      : 0x0005ad00001ffb80 ports 2 "node47 HCA-1"
Ca      : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1"
Ca      : 0x0005ad00001ffb44 ports 2 "node45 HCA-1"
Ca      : 0x0005ad00001ffb94 ports 2 "node44 HCA-1"
Ca      : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1"
Ca      : 0x0005ad00001ffb40 ports 2 "node42 HCA-1"
Ca      : 0x0005ad00001ffb24 ports 2 "node41 HCA-1"
Ca      : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1"
Ca      : 0x0005ad00001ff968 ports 2 "node39 HCA-1"
Ca      : 0x0005ad00001ffb6c ports 2 "node38 HCA-1"
Ca      : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1"
Ca      : 0x0005ad00001ff970 ports 2 "node36 HCA-1"
Ca      : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1"
Ca      : 0x0005ad00001ffaac ports 2 "node34 HCA-1"
Ca      : 0x0005ad00001ffb20 ports 2 "node33 HCA-1"
Ca      : 0x0005ad00001ffa34 ports 2 "node12 HCA-1"
Ca      : 0x0005ad00001ffa68 ports 2 "node11 HCA-1"
Ca      : 0x0005ad00001ff978 ports 2 "node10 HCA-1"
Ca      : 0x0005ad00001ff974 ports 2 "node09 HCA-1"
Ca      : 0x0005ad00001ffb34 ports 2 "node08 HCA-1"
Ca      : 0x0005ad00001ffa48 ports 2 "node07 HCA-1"
Ca      : 0x0005ad00001ff98c ports 2 "node06 HCA-1"
Ca      : 0x0005ad00001ff994 ports 2 "node05 HCA-1"
Ca      : 0x0005ad00001ff988 ports 2 "node04 HCA-1"
Ca      : 0x0005ad00001ffa3c ports 2 "node03 HCA-1"
Ca      : 0x0005ad00001ff9ac ports 2 "node02 HCA-1"
Ca      : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1"
Ca      : 0x0005ad00001ff96c ports 2 "node24 HCA-1"
Ca      : 0x0005ad00001ffb60 ports 2 "node23 HCA-1"
Ca      : 0x0005ad00001ffa38 ports 2 "node22 HCA-1"
Ca      : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1"
Ca      : 0x0005ad00001ffb4c ports 2 "node20 HCA-1"
Ca      : 0x0005ad00001ffa40 ports 2 "node19 HCA-1"
Ca      : 0x0005ad00001ff998 ports 2 "node18 HCA-1"
Ca      : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1"
Ca      : 0x0005ad00001ffa9c ports 2 "node16 HCA-1"
Ca      : 0x0005ad00001ffae8 ports 2 "node15 HCA-1"
Ca      : 0x0005ad00001ffb3c ports 2 "node14 HCA-1"
Ca      : 0x0005ad00001ff984 ports 2 "node13 HCA-1"
Ca      : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1"
Ca      : 0x0005ad00001ff97c ports 2 "node31 HCA-1"
Ca      : 0x0005ad00001ff980 ports 2 "node30 HCA-1"
Ca      : 0x0005ad00001ffab0 ports 2 "node29 HCA-1"
Ca      : 0x0005ad00001ffafc ports 2 "node28 HCA-1"
Ca      : 0x0005ad00001ffb64 ports 2 "node27 HCA-1"
Ca      : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1"
Ca      : 0x0005ad00001ffb50 ports 2 "node25 HCA-1"
Ca      : 0x0005ad00001ffb28 ports 2 "aurora HCA-1"

[root at aurora ~]# ibdiagnet
Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2
-W- Topology file is not specified.
    Reports regarding cluster links will use direct routes.
Loading IBDM from: /usr/lib64/ibdm1.2
-I- Using port 1 as the local port.
-I- Discovering ... 77 nodes (12 Switches & 65 CA-s) discovered.


-I---------------------------------------------------
-I- Bad Guids/LIDs Info
-I---------------------------------------------------
-I- No bad Guids were found

-I---------------------------------------------------
-I- Links With Logical State = INIT
-I---------------------------------------------------
-I- No bad Links (with logical state = INIT) were found

-I---------------------------------------------------
-I- PM Counters Info
-I---------------------------------------------------
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=11
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=12
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0039 guid=0x0005ad00001ffb6d dev=25208 node38/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=19
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=20
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=21
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=23
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0015 guid=0x0005ad00001ffb3d dev=25208 node14/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x004c guid=0x0005ad00001ffab1 dev=25208 node29/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=15
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=16
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=19
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=23
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x001e guid=0x0005ad00001ffb51 dev=25208 node25/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0002 guid=0x0005ad00001ffaa1 dev=25208 node01/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x001b guid=0x0005ad00001ffaad dev=25208 node34/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0020 guid=0x0005ad00001ffa69 dev=25208 node11/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0019 guid=0x0005ad00001ff989 dev=25208 node04/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=19
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=20
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=21
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=23
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0016 guid=0x0005ad00001ffb35 dev=25208 node08/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x002c guid=0x0005ad00001ff995 dev=25208 node05/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0010 guid=0x0005ad00001ff98d dev=25208 node06/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x004a guid=0x0005ad00001ff9a5 dev=25208 node46/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0027 guid=0x0005ad00001ff9b9 dev=25208 node26/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=7
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=8
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=9
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x001f guid=0x0005ad00001ffa35 dev=25208 node12/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=19
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=20
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0045 guid=0x0005ad00001ffaf9 dev=25208 node43/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x003d guid=0x0005ad00001ffafd dev=25208 node28/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=13
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=20
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0014 guid=0x0005ad00001ff979 dev=25208 node10/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0035 guid=0x0005ad00001ffb31 dev=25208 node58/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x002a guid=0x0005ad00001ffa9d dev=25208 node16/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0028 guid=0x0005ad00001ffa39 dev=25208 node22/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=19
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=20
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=21
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=23
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0001 guid=0x0005ad00001ffb29 dev=25208 aurora/P1
      Performance Monitor counter     : Value
      vl15_dropped                    : 0xffff (overflow)
      symbol_error_counter            : 0xffff (overflow)
-W- lid=0x0043 guid=0x0005ad00001ffb81 dev=25208 node47/P1
      Performance Monitor counter     : Value
      symbol_error_counter            : 0xffff (overflow)

-I---------------------------------------------------
-I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list)
-I---------------------------------------------------
-I-    PKey:0x7fff Hosts:65 full:65 partial:0

-I---------------------------------------------------
-I- IPoIB Subnets Check
-I---------------------------------------------------
-I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps
SL:0x00
-W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps

-I---------------------------------------------------
-I- Bad Links Info
-I- No bad link were found
-I---------------------------------------------------
----------------------------------------------------------------
-I- Stages Status Report:
    STAGE                                    Errors Warnings
    Bad GUIDs/LIDs Check                     0      0
    Link State Active Check                  0      0
    Performance Counters Report              0      47
    Partitions Check                         0      0
    IPoIB Subnets Check                      0      1

Please see /tmp/ibdiagnet.log for complete log
----------------------------------------------------------------

-I- Done. Run time was 5 seconds.




-- 
Prentice



More information about the Beowulf mailing list