[Beowulf] IB problem with openmpi 1.2.8

Tue Jul 13 13:50:06 PDT 2010

Bill,

Have you checked the health of the cables themselves? It could just be
dumb luck that a hardware failure coincided with a software change,
didn't manifest itself until the reboot of the nodes. Did you reboot the
switches, too?

I would try dividing your cluster into small sections and see if the
problem exists across the sections.

Can you disconnect the edge switches from the core switch, so that each
edge switch is it's own, isolated fabric? If so, you could then start an
sm on each fabric and see if the problem is on every smaller IB fabric,
or just one.

The other option would be to disconnect all the nodes and add them back
one  by one, but that wouldn't catch a problem with a switch-to-switch
connection.

How big is the cluster? Would it take hours or days to test each node
like this?

You say the problem occurs when the node count goes over 32 (or 40) do
you mean 32 physical nodes, or 32 processors. How does your scheduler
assign nodes? Would those 32 nodes always be in the same rack or on the
same IB switch, but not when the count increases?

Prentice

Bill Wichser wrote:
> Just some more info.  Went back to the prior kernel with no luck. 
> Updated the firmware on the Topspin HBA cards to the latest (final)
> version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not
> sure where to look.
> 
> Bill Wichser wrote:
>> Machine is an older Intel Woodcrest cluster with a two tiered IB
>> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a
>> SFS-7008P with a single management module which runs the SM manager. 
>> The cluster runs RHEL4 and was upgraded last week to kernel
>> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much
>> stock.
>>
>> After rebooting, the IB cards in the nodes remained in the INIT
>> state.  I rebooted the chassis IB switch as it appeared that no SM was
>> running.  No help.  I manually started an opensm on a compute node
>> telling it to ignore other masters as initially it would only come up
>> in STANDBY.  This turned all the nodes' IB ports to active and I
>> thought that I was done.
>>
>> ibdiagnet complained that there were two masters.  So I killed the
>> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back
>> with OSMTEST: TEST "All Validations" PASS.
>> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with
>> everything coming up roses.
>>
>> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
>> node count goes over 32 (or maybe 40).  This worked fine in the past,
>> before the reboot.  User apps are failing as well as IMB v3.2.  I've
>> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
>> 20" which helped for 48 nodes but when increasing to 64 and 128 it
>> didn't help at all.  Typical error message follow.
>>
>> Right now I am stuck.  I'm not sure what or where the problem might
>> be.  Nor where to go next.  If anyone has a clue, I'd appreciate
>> hearing it!
>>
>> Thanks,
>> Bill
>>
>>
>> typical error messages
>>
>> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
>> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
>> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
>> --------------------------------------------------------------------------
>>
>> The InfiniBand retry count between two MPI processes has been
>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>> (section 12.7.38):
>>
>>    The total number of times that the sender wishes the receiver to
>>    retry timeout, packet sequence, etc. errors before posting a
>>    completion error.
>>
>> This error typically means that there is something awry within the
>> InfiniBand fabric itself.  You should note the hosts on which this
>> error has occurred; it has been observed that rebooting or removing a
>> particular host from the job can sometimes resolve this issue.
>>
>> Two MCA parameters can be used to control Open MPI's behavior with
>> respect to the retry count:
>>
>> * btl_openib_ib_retry_count - The number of times the sender will
>>  attempt to retry (defaulted to 7, the maximum value).
>>
>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>>  to 10).  The actual timeout value used is calculated as:
>>
>>     4.096 microseconds * (2^btl_openib_ib_timeout)
>>
>>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>> --------------------------------------------------------------------------
>>
>> --------------------------------------------------------------------------
>>
>>
>> DIFFERENT RUN:
>>
>> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
>> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
>> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
>> ...
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
> 

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ