[Beowulf] IB problem with openmpi 1.2.8

Bill Wichser bill at Princeton.EDU
Tue Jul 13 13:09:20 PDT 2010


Just some more info.  Went back to the prior kernel with no luck.  
Updated the firmware on the Topspin HBA cards to the latest (final) 
version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not 
sure where to look.

Bill Wichser wrote:
> Machine is an older Intel Woodcrest cluster with a two tiered IB 
> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a 
> SFS-7008P with a single management module which runs the SM manager.  
> The cluster runs RHEL4 and was upgraded last week to kernel 
> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much 
> stock.
>
> After rebooting, the IB cards in the nodes remained in the INIT 
> state.  I rebooted the chassis IB switch as it appeared that no SM was 
> running.  No help.  I manually started an opensm on a compute node 
> telling it to ignore other masters as initially it would only come up 
> in STANDBY.  This turned all the nodes' IB ports to active and I 
> thought that I was done.
>
> ibdiagnet complained that there were two masters.  So I killed the 
> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back 
> with OSMTEST: TEST "All Validations" PASS.
> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with 
> everything coming up roses.
>
> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the 
> node count goes over 32 (or maybe 40).  This worked fine in the past, 
> before the reboot.  User apps are failing as well as IMB v3.2.  I've 
> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout 
> 20" which helped for 48 nodes but when increasing to 64 and 128 it 
> didn't help at all.  Typical error message follow.
>
> Right now I am stuck.  I'm not sure what or where the problem might 
> be.  Nor where to go next.  If anyone has a clue, I'd appreciate 
> hearing it!
>
> Thanks,
> Bill
>
>
> typical error messages
>
> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
> -------------------------------------------------------------------------- 
>
> The InfiniBand retry count between two MPI processes has been
> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
> (section 12.7.38):
>
>    The total number of times that the sender wishes the receiver to
>    retry timeout, packet sequence, etc. errors before posting a
>    completion error.
>
> This error typically means that there is something awry within the
> InfiniBand fabric itself.  You should note the hosts on which this
> error has occurred; it has been observed that rebooting or removing a
> particular host from the job can sometimes resolve this issue.
>
> Two MCA parameters can be used to control Open MPI's behavior with
> respect to the retry count:
>
> * btl_openib_ib_retry_count - The number of times the sender will
>  attempt to retry (defaulted to 7, the maximum value).
>
> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>  to 10).  The actual timeout value used is calculated as:
>
>     4.096 microseconds * (2^btl_openib_ib_timeout)
>
>  See the InfiniBand spec 1.2 (section 12.7.34) for more details.
> -------------------------------------------------------------------------- 
>
> -------------------------------------------------------------------------- 
>
>
> DIFFERENT RUN:
>
> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress] 
> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY 
> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
> ...
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf



More information about the Beowulf mailing list