[Beowulf] Poor bandwith from one compute node

Gus Correa gus at ldeo.columbia.edu
Thu Aug 17 11:40:22 PDT 2017


On 08/17/2017 12:35 PM, Joe Landman wrote:
> 
> 
> On 08/17/2017 12:00 PM, Faraz Hussain wrote:
>> I noticed an mpi job was taking 5X longer to run whenever it got the 
>> compute node lusytp104 . So I ran qperf and found the bandwidth 
>> between it and any other nodes was ~100MB/sec. This is much lower than 
>> ~1GB/sec between all the other nodes. Any tips on how to debug 
>> further? I haven't tried rebooting since it is currently running a 
>> single-node job.
>>
>> [hussaif1 at lusytp114 ~]$ qperf lusytp104 tcp_lat tcp_bw
>> tcp_lat:
>>     latency  =  17.4 us
>> tcp_bw:
>>     bw  =  118 MB/sec
>> [hussaif1 at lusytp114 ~]$ qperf lusytp113 tcp_lat tcp_bw
>> tcp_lat:
>>     latency  =  20.4 us
>> tcp_bw:
>>     bw  =  1.07 GB/sec
>>
>> This is separate issue from my previous post about a slow compute 
>> node. I am still investigating that per the helpful replies. Will post 
>> an update about that once I find the root cause!
> 
> Sounds very much like it is running over gigabit ethernet vs 
> Infiniband.  Check to make sure it is using the right network ...

Hi Faraz

As others have said answering your previous posting about Infiniband:

- Check if the node is configured the same way as the other nodes,
in the case of Infinband, if the MTU is the same,
using connected or datagram mode, etc.

**

Besides, for Open MPI you can force it at runtime not to use tcp:
--mca btl ^tcp
or with the syntax in this FAQ:
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl

If that node has an Infinband interface with a problem,
this should at least give a clue.

**

In addition, check the limits in the node.
That may be set by your resource manager,
or in /etc/security/limits.conf
or perhaps in the actual job script.
The memlock limit is key to Open MPI over Infiniband.
See FAQ 15, 16, 17 here:
https://www.open-mpi.org/faq/?category=openfabrics

**

Moreover, check if the mlx4_core.conf (assuming it is Mellanox HW)
is configured the same way across the nodes:

/etc/modprobe.d/mlx4_core.conf

See FAQ 18 here:
https://www.open-mpi.org/faq/?category=openfabrics

**

To increase the btl diagnostic verbosity (that goes to STDERR, IRRC):

--mca btl_base_verbose 30

That may point out which interfaces are actually being used, etc.

See this FAQ:

https://www.open-mpi.org/faq/?category=all#diagnose-multi-host-problems

**

Finally, as John has suggested before, you may want to
subscribe to the Open MPI mailing list,
and ask the question there as well:

https://www.open-mpi.org/community/help/
https://www.open-mpi.org/community/lists/

There you will get feedback from the Open MPI developers +
user community, and that often includes insights from
Intel and Mellanox IB hardware experts.

**

I hope this helps.

Gus Correa

> 
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
> 



More information about the Beowulf mailing list