[Beowulf] IB problem/using IB diagnostics
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Prentice Bisbal prentice at ias.eduThu Jun 18 10:48:16 PDT 2009
- Previous message: [Beowulf] IB problem/using IB diagnostics
- Next message: [Beowulf] noobs: what comes next?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Joe Landman wrote: > Prentice Bisbal wrote: >> One of my nodes has an IB problem: >> >> -------------------------------------------------------------------------- >> >> WARNING: There is at least on IB HCA found on host 'node36.aurora', but >> there is >> no active ports detected. This is most certainly not what you wanted. >> Check your cables and SM configuration. >> -------------------------------------------------------------------------- >> >> >> >> I've been trying to use some of the IB diagnostics tools to diagnose the >> problem, but I can't seem to figure out the correct syntax of the >> commands. On a node that I know is working correctly (the master node): >> > > Sanity check: > > ibnodes or ibhosts report what? > > This will test the connection, session manager, and a few other > things. If you get nothing back then things are problematic > > ibdiagnet is also useful for status of the subnets. > The output of these 3 commands is below. I don't see anything really unusual. Sorry for the long e-mail, and thanks for the help. [root at aurora ~]# ibnodes Ca : 0x0005ad00001ffb70 ports 2 "node64 HCA-1" Ca : 0x0005ad00001ffaec ports 2 "node63 HCA-1" Ca : 0x0005ad00001ffb1c ports 2 "node62 HCA-1" Ca : 0x0005ad00001ffab8 ports 2 "node61 HCA-1" Ca : 0x0005ad00001ffa90 ports 2 "node60 HCA-1" Ca : 0x0005ad00001ffae0 ports 2 "node59 HCA-1" Ca : 0x0005ad00001ffb30 ports 2 "node58 HCA-1" Ca : 0x0005ad00001ffba4 ports 2 "node57 HCA-1" Ca : 0x0005ad00001ffb2c ports 2 "node56 HCA-1" Ca : 0x0005ad00001ffb74 ports 2 "node55 HCA-1" Ca : 0x0005ad00001ff990 ports 2 "node54 HCA-1" Ca : 0x0005ad00001ffb58 ports 2 "node53 HCA-1" Ca : 0x0005ad00001ff964 ports 2 "node52 HCA-1" Ca : 0x0005ad00001ff99c ports 2 "node51 HCA-1" Ca : 0x0005ad00001ffb54 ports 2 "node50 HCA-1" Ca : 0x0005ad00001ffb48 ports 2 "node49 HCA-1" Ca : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1" Ca : 0x0005ad00001ffb80 ports 2 "node47 HCA-1" Ca : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1" Ca : 0x0005ad00001ffb44 ports 2 "node45 HCA-1" Ca : 0x0005ad00001ffb94 ports 2 "node44 HCA-1" Ca : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1" Ca : 0x0005ad00001ffb40 ports 2 "node42 HCA-1" Ca : 0x0005ad00001ffb24 ports 2 "node41 HCA-1" Ca : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1" Ca : 0x0005ad00001ff968 ports 2 "node39 HCA-1" Ca : 0x0005ad00001ffb6c ports 2 "node38 HCA-1" Ca : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1" Ca : 0x0005ad00001ff970 ports 2 "node36 HCA-1" Ca : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1" Ca : 0x0005ad00001ffaac ports 2 "node34 HCA-1" Ca : 0x0005ad00001ffb20 ports 2 "node33 HCA-1" Ca : 0x0005ad00001ffa34 ports 2 "node12 HCA-1" Ca : 0x0005ad00001ffa68 ports 2 "node11 HCA-1" Ca : 0x0005ad00001ff978 ports 2 "node10 HCA-1" Ca : 0x0005ad00001ff974 ports 2 "node09 HCA-1" Ca : 0x0005ad00001ffb34 ports 2 "node08 HCA-1" Ca : 0x0005ad00001ffa48 ports 2 "node07 HCA-1" Ca : 0x0005ad00001ff98c ports 2 "node06 HCA-1" Ca : 0x0005ad00001ff994 ports 2 "node05 HCA-1" Ca : 0x0005ad00001ff988 ports 2 "node04 HCA-1" Ca : 0x0005ad00001ffa3c ports 2 "node03 HCA-1" Ca : 0x0005ad00001ff9ac ports 2 "node02 HCA-1" Ca : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1" Ca : 0x0005ad00001ff96c ports 2 "node24 HCA-1" Ca : 0x0005ad00001ffb60 ports 2 "node23 HCA-1" Ca : 0x0005ad00001ffa38 ports 2 "node22 HCA-1" Ca : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1" Ca : 0x0005ad00001ffb4c ports 2 "node20 HCA-1" Ca : 0x0005ad00001ffa40 ports 2 "node19 HCA-1" Ca : 0x0005ad00001ff998 ports 2 "node18 HCA-1" Ca : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1" Ca : 0x0005ad00001ffa9c ports 2 "node16 HCA-1" Ca : 0x0005ad00001ffae8 ports 2 "node15 HCA-1" Ca : 0x0005ad00001ffb3c ports 2 "node14 HCA-1" Ca : 0x0005ad00001ff984 ports 2 "node13 HCA-1" Ca : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1" Ca : 0x0005ad00001ff97c ports 2 "node31 HCA-1" Ca : 0x0005ad00001ff980 ports 2 "node30 HCA-1" Ca : 0x0005ad00001ffab0 ports 2 "node29 HCA-1" Ca : 0x0005ad00001ffafc ports 2 "node28 HCA-1" Ca : 0x0005ad00001ffb64 ports 2 "node27 HCA-1" Ca : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1" Ca : 0x0005ad00001ffb50 ports 2 "node25 HCA-1" Ca : 0x0005ad00001ffb28 ports 2 "aurora HCA-1" Switch : 0x0005ad00070435e2 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 6, Chip A" base port 0 lid 3 lmc 0 Switch : 0x0005ad10060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 1, Chip B" base port 0 lid 9 lmc 0 Switch : 0x0005ad100604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 3, Chip B" base port 0 lid 8 lmc 0 Switch : 0x0005ad100604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 2, Chip B" base port 0 lid 7 lmc 0 Switch : 0x0005ad00060434a1 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 1, Chip A" enhanced port 0 lid 6 lmc 0 Switch : 0x0005ad000604375e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 3, Chip A" base port 0 lid 5 lmc 0 Switch : 0x0005ad000604351a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Spine 2, Chip A" enhanced port 0 lid 4 lmc 0 Switch : 0x0005ad000704348e ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 4, Chip A" base port 0 lid 14 lmc 0 Switch : 0x0005ad000704348c ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 2, Chip A" base port 0 lid 13 lmc 0 Switch : 0x0005ad000704348a ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 1, Chip A" base port 0 lid 11 lmc 0 Switch : 0x0005ad0007043489 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 3, Chip A" base port 0 lid 10 lmc 0 Switch : 0x0005ad00070435d4 ports 24 "SFS-7012 GUID=0x0005ad00020434bd Leaf 5, Chip A" base port 0 lid 12 lmc 0 [root at aurora ~]# ibhosts Ca : 0x0005ad00001ffb70 ports 2 "node64 HCA-1" Ca : 0x0005ad00001ffaec ports 2 "node63 HCA-1" Ca : 0x0005ad00001ffb1c ports 2 "node62 HCA-1" Ca : 0x0005ad00001ffab8 ports 2 "node61 HCA-1" Ca : 0x0005ad00001ffa90 ports 2 "node60 HCA-1" Ca : 0x0005ad00001ffae0 ports 2 "node59 HCA-1" Ca : 0x0005ad00001ffb30 ports 2 "node58 HCA-1" Ca : 0x0005ad00001ffba4 ports 2 "node57 HCA-1" Ca : 0x0005ad00001ffb2c ports 2 "node56 HCA-1" Ca : 0x0005ad00001ffb74 ports 2 "node55 HCA-1" Ca : 0x0005ad00001ff990 ports 2 "node54 HCA-1" Ca : 0x0005ad00001ffb58 ports 2 "node53 HCA-1" Ca : 0x0005ad00001ff964 ports 2 "node52 HCA-1" Ca : 0x0005ad00001ff99c ports 2 "node51 HCA-1" Ca : 0x0005ad00001ffb54 ports 2 "node50 HCA-1" Ca : 0x0005ad00001ffb48 ports 2 "node49 HCA-1" Ca : 0x0005ad00001ff9a0 ports 2 "node48 HCA-1" Ca : 0x0005ad00001ffb80 ports 2 "node47 HCA-1" Ca : 0x0005ad00001ff9a4 ports 2 "node46 HCA-1" Ca : 0x0005ad00001ffb44 ports 2 "node45 HCA-1" Ca : 0x0005ad00001ffb94 ports 2 "node44 HCA-1" Ca : 0x0005ad00001ffaf8 ports 2 "node43 HCA-1" Ca : 0x0005ad00001ffb40 ports 2 "node42 HCA-1" Ca : 0x0005ad00001ffb24 ports 2 "node41 HCA-1" Ca : 0x0005ad00001ffbc4 ports 2 "node40 HCA-1" Ca : 0x0005ad00001ff968 ports 2 "node39 HCA-1" Ca : 0x0005ad00001ffb6c ports 2 "node38 HCA-1" Ca : 0x0005ad00001ffaa4 ports 2 "node37 HCA-1" Ca : 0x0005ad00001ff970 ports 2 "node36 HCA-1" Ca : 0x0005ad00001ffaa8 ports 2 "node35 HCA-1" Ca : 0x0005ad00001ffaac ports 2 "node34 HCA-1" Ca : 0x0005ad00001ffb20 ports 2 "node33 HCA-1" Ca : 0x0005ad00001ffa34 ports 2 "node12 HCA-1" Ca : 0x0005ad00001ffa68 ports 2 "node11 HCA-1" Ca : 0x0005ad00001ff978 ports 2 "node10 HCA-1" Ca : 0x0005ad00001ff974 ports 2 "node09 HCA-1" Ca : 0x0005ad00001ffb34 ports 2 "node08 HCA-1" Ca : 0x0005ad00001ffa48 ports 2 "node07 HCA-1" Ca : 0x0005ad00001ff98c ports 2 "node06 HCA-1" Ca : 0x0005ad00001ff994 ports 2 "node05 HCA-1" Ca : 0x0005ad00001ff988 ports 2 "node04 HCA-1" Ca : 0x0005ad00001ffa3c ports 2 "node03 HCA-1" Ca : 0x0005ad00001ff9ac ports 2 "node02 HCA-1" Ca : 0x0005ad00001ffaa0 ports 2 "node01 HCA-1" Ca : 0x0005ad00001ff96c ports 2 "node24 HCA-1" Ca : 0x0005ad00001ffb60 ports 2 "node23 HCA-1" Ca : 0x0005ad00001ffa38 ports 2 "node22 HCA-1" Ca : 0x0005ad00001ff9a8 ports 2 "node21 HCA-1" Ca : 0x0005ad00001ffb4c ports 2 "node20 HCA-1" Ca : 0x0005ad00001ffa40 ports 2 "node19 HCA-1" Ca : 0x0005ad00001ff998 ports 2 "node18 HCA-1" Ca : 0x0005ad00001ffbb8 ports 2 "node17 HCA-1" Ca : 0x0005ad00001ffa9c ports 2 "node16 HCA-1" Ca : 0x0005ad00001ffae8 ports 2 "node15 HCA-1" Ca : 0x0005ad00001ffb3c ports 2 "node14 HCA-1" Ca : 0x0005ad00001ff984 ports 2 "node13 HCA-1" Ca : 0x0005ad00001ffbc0 ports 2 "node32 HCA-1" Ca : 0x0005ad00001ff97c ports 2 "node31 HCA-1" Ca : 0x0005ad00001ff980 ports 2 "node30 HCA-1" Ca : 0x0005ad00001ffab0 ports 2 "node29 HCA-1" Ca : 0x0005ad00001ffafc ports 2 "node28 HCA-1" Ca : 0x0005ad00001ffb64 ports 2 "node27 HCA-1" Ca : 0x0005ad00001ff9b8 ports 2 "node26 HCA-1" Ca : 0x0005ad00001ffb50 ports 2 "node25 HCA-1" Ca : 0x0005ad00001ffb28 ports 2 "aurora HCA-1" [root at aurora ~]# ibdiagnet Loading IBDIAGNET from: /usr/lib64/ibdiagnet1.2 -W- Topology file is not specified. Reports regarding cluster links will use direct routes. Loading IBDM from: /usr/lib64/ibdm1.2 -I- Using port 1 as the local port. -I- Discovering ... 77 nodes (12 Switches & 65 CA-s) discovered. -I--------------------------------------------------- -I- Bad Guids/LIDs Info -I--------------------------------------------------- -I- No bad Guids were found -I--------------------------------------------------- -I- Links With Logical State = INIT -I--------------------------------------------------- -I- No bad Links (with logical state = INIT) were found -I--------------------------------------------------- -I- PM Counters Info -I--------------------------------------------------- -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=11 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=12 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0039 guid=0x0005ad00001ffb6d dev=25208 node38/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0015 guid=0x0005ad00001ffb3d dev=25208 node14/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x004c guid=0x0005ad00001ffab1 dev=25208 node29/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=15 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=16 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0005 guid=0x0005ad000604375e dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001e guid=0x0005ad00001ffb51 dev=25208 node25/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0002 guid=0x0005ad00001ffaa1 dev=25208 node01/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001b guid=0x0005ad00001ffaad dev=25208 node34/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0020 guid=0x0005ad00001ffa69 dev=25208 node11/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0019 guid=0x0005ad00001ff989 dev=25208 node04/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0004 guid=0x0005ad000604351a dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0016 guid=0x0005ad00001ffb35 dev=25208 node08/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x002c guid=0x0005ad00001ff995 dev=25208 node05/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0010 guid=0x0005ad00001ff98d dev=25208 node06/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x004a guid=0x0005ad00001ff9a5 dev=25208 node46/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0027 guid=0x0005ad00001ff9b9 dev=25208 node26/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=7 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=8 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0006 guid=0x0005ad00060434a1 dev=47396 Port=9 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x001f guid=0x0005ad00001ffa35 dev=25208 node12/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0009 guid=0x0005ad10060434a1 dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0045 guid=0x0005ad00001ffaf9 dev=25208 node43/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x003d guid=0x0005ad00001ffafd dev=25208 node28/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=13 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0008 guid=0x0005ad100604375e dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0014 guid=0x0005ad00001ff979 dev=25208 node10/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0035 guid=0x0005ad00001ffb31 dev=25208 node58/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x002a guid=0x0005ad00001ffa9d dev=25208 node16/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0028 guid=0x0005ad00001ffa39 dev=25208 node22/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=19 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=20 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=21 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0007 guid=0x0005ad100604351a dev=47396 Port=23 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -W- lid=0x0001 guid=0x0005ad00001ffb29 dev=25208 aurora/P1 Performance Monitor counter : Value vl15_dropped : 0xffff (overflow) symbol_error_counter : 0xffff (overflow) -W- lid=0x0043 guid=0x0005ad00001ffb81 dev=25208 node47/P1 Performance Monitor counter : Value symbol_error_counter : 0xffff (overflow) -I--------------------------------------------------- -I- Fabric Partitions Report (see ibdiagnet.pkey for a full hosts list) -I--------------------------------------------------- -I- PKey:0x7fff Hosts:65 full:65 partial:0 -I--------------------------------------------------- -I- IPoIB Subnets Check -I--------------------------------------------------- -I- Subnet: IPv4 PKey:0x7fff QKey:0x00000b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Suboptimal rate for group. Lowest member rate:20Gbps > group-rate:10Gbps -I--------------------------------------------------- -I- Bad Links Info -I- No bad link were found -I--------------------------------------------------- ---------------------------------------------------------------- -I- Stages Status Report: STAGE Errors Warnings Bad GUIDs/LIDs Check 0 0 Link State Active Check 0 0 Performance Counters Report 0 47 Partitions Check 0 0 IPoIB Subnets Check 0 1 Please see /tmp/ibdiagnet.log for complete log ---------------------------------------------------------------- -I- Done. Run time was 5 seconds. -- Prentice
- Previous message: [Beowulf] IB problem/using IB diagnostics
- Next message: [Beowulf] noobs: what comes next?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
