[Beowulf] IB problem with openmpi 1.2.8

Tue Jul 13 14:50:35 PDT 2010

On 7/13/2010 4:50 PM, Prentice Bisbal wrote:
> Bill,
>
> Have you checked the health of the cables themselves? It could just be
> dumb luck that a hardware failure coincided with a software change,
> didn't manifest itself until the reboot of the nodes. Did you reboot the
> switches, too?
>    
Just looked at all the lights and they all seem fine.

> I would try dividing your cluster into small sections and see if the
> problem exists across the sections.
>
> Can you disconnect the edge switches from the core switch, so that each
> edge switch is it's own, isolated fabric? If so, you could then start an
> sm on each fabric and see if the problem is on every smaller IB fabric,
> or just one.
>    
I've thought about this one.  Non-trivial.  I have a core switch 
connecting 12 leaf switches.  Each switch connects to 16 nodes.  I need 
to use that core switch in order to make the problem appear.
> The other option would be to disconnect all the nodes and add them back
> one  by one, but that wouldn't catch a problem with a switch-to-switch
> connection.
>
> How big is the cluster? Would it take hours or days to test each node
> like this?
>
>    
192 nodes (8 cores each).
> You say the problem occurs when the node count goes over 32 (or 40) do
> you mean 32 physical nodes, or 32 processors. How does your scheduler
> assign nodes? Would those 32 nodes always be in the same rack or on the
> same IB switch, but not when the count increases?
>    
It starts failing at 48 nodes.  PBS allocates as least loaded, round 
robin fashion.  But sequentially, minus the PVFS nodes, which are 
distributed throughout the cluster and allocated last in round robin.  
The 32 nodes definately go through the core.  And it never seems to 
matter where.  I've tried to pinpoint some nodes by keeping lists but 
this happens everywhere.  I was hoping that some tool I'm not aware of 
exists but apparently not.  My next attempt may be to pull the 
management card from the core and just run opensm on nodes themselves, 
like we do for other clusters.  But I can test with osmtest all day and 
never get errors.  This makes me feel very uncomfortable!

Of course, nothing is under warranty anymore.  Divide and conquer seems 
like the only solution.

Thanks,
Bill
> Prentice
>
>
>
> Bill Wichser wrote:
>    
>> Just some more info.  Went back to the prior kernel with no luck.
>> Updated the firmware on the Topspin HBA cards to the latest (final)
>> version (fw-25208-4_8_200-MHEL-CF128-T).    Nothing changes.   Still not
>> sure where to look.
>>
>> Bill Wichser wrote:
>>      
>>> Machine is an older Intel Woodcrest cluster with a two tiered IB
>>> infrastructure with Topspin/Cisco 7000 switches.  The core switch is a
>>> SFS-7008P with a single management module which runs the SM manager.
>>> The cluster runs RHEL4 and was upgraded last week to kernel
>>> 2.6.9-89.0.26.ELsmp.  The openib-1.4 remained the same.  Pretty much
>>> stock.
>>>
>>> After rebooting, the IB cards in the nodes remained in the INIT
>>> state.  I rebooted the chassis IB switch as it appeared that no SM was
>>> running.  No help.  I manually started an opensm on a compute node
>>> telling it to ignore other masters as initially it would only come up
>>> in STANDBY.  This turned all the nodes' IB ports to active and I
>>> thought that I was done.
>>>
>>> ibdiagnet complained that there were two masters.  So I killed the
>>> opensm and now it was happy.  osmtest -f c/osmtest -f a  comes back
>>> with OSMTEST: TEST "All Validations" PASS.
>>> ibdiagnet -ls 2.5 -lw 4x   finds all my switches and nodes with
>>> everything coming up roses.
>>>
>>> The problem is that openmpi 1.2.8 with Intel 11.1.074 fails when the
>>> node count goes over 32 (or maybe 40).  This worked fine in the past,
>>> before the reboot.  User apps are failing as well as IMB v3.2.  I've
>>> increased the timeout using the "mpiexec -mca btl_openib_ib_timeout
>>> 20" which helped for 48 nodes but when increasing to 64 and 128 it
>>> didn't help at all.  Typical error message follow.
>>>
>>> Right now I am stuck.  I'm not sure what or where the problem might
>>> be.  Nor where to go next.  If anyone has a clue, I'd appreciate
>>> hearing it!
>>>
>>> Thanks,
>>> Bill
>>>
>>>
>>> typical error messages
>>>
>>> [0,1,33][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-050 to: woodhen-036 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 182937592248 opcode 0
>>> [0,1,36][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-084 to: woodhen-085 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 5840952 opcode 0
>>> [0,1,40][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-098 to: woodhen-096 error polling LP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 182947573944 opcode 0
>>> --------------------------------------------------------------------------
>>>
>>> The InfiniBand retry count between two MPI processes has been
>>> exceeded.  "Retry count" is defined in the InfiniBand spec 1.2
>>> (section 12.7.38):
>>>
>>>     The total number of times that the sender wishes the receiver to
>>>     retry timeout, packet sequence, etc. errors before posting a
>>>     completion error.
>>>
>>> This error typically means that there is something awry within the
>>> InfiniBand fabric itself.  You should note the hosts on which this
>>> error has occurred; it has been observed that rebooting or removing a
>>> particular host from the job can sometimes resolve this issue.
>>>
>>> Two MCA parameters can be used to control Open MPI's behavior with
>>> respect to the retry count:
>>>
>>> * btl_openib_ib_retry_count - The number of times the sender will
>>>   attempt to retry (defaulted to 7, the maximum value).
>>>
>>> * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
>>>   to 10).  The actual timeout value used is calculated as:
>>>
>>>      4.096 microseconds * (2^btl_openib_ib_timeout)
>>>
>>>   See the InfiniBand spec 1.2 (section 12.7.34) for more details.
>>> --------------------------------------------------------------------------
>>>
>>> --------------------------------------------------------------------------
>>>
>>>
>>> DIFFERENT RUN:
>>>
>>> [0,1,92][btl_openib_component.c:1371:btl_openib_component_progress]
>>> from woodhen-157 to: woodhen-081 error polling HP CQ with status RETRY
>>> EXCEEDED ERROR status number 12 for wr_id 183541169080 opcode 0
>>> ...
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>>        
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>      
>