[Beowulf] Good IB network performance when using 7 cores, poor performance on all 8?

Thu Apr 24 09:17:33 PDT 2014

I am with Joe regarding looking at the interrupts.

However, could this be a difference with the power management with the
Redhat kernel?
ie. when running on 8 cores you are tripping over some thermal threshold
and causing a throttle back to a lower C-state?

Can you give the kernel versions for both setups?

On 24 April 2014 16:56, Joe Landman <landman at scalableinformatics.com> wrote:

> On 04/24/2014 11:31 AM, Brian Dobbins wrote:
>
>>
>> Hi everyone,
>>
>>    We're having a problem with one of our clusters after it was upgraded
>> to RH6.2 (from CentOS5.5) - the performance of our Infiniband network
>> degrades randomly and severely when using all 8 cores in our nodes for
>> MPI,... but not when using only 7 cores per node.
>>
>>    For example, I have a hacked-together script (below) that does a
>> sequence of 20 sets of fifty MPI_Allreduce tests via the Intel MPI
>> benchmarks, and then calculates statistics on the average times per
>> individual set.  For our 'good' (CentOS 5.5) nodes, we see consistent
>> results:
>>
>> % perftest hosts_c20_8c.txt
>>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>    176.0   177.3   182.6   182.8   186.1   196.9
>> % perftest hosts_c20_8c.txt
>>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>    176.3   180.4   184.8   187.0   189.1   213.5
>>
>>    ... But for our tests on the RH6.2 install, we see enormous variance:
>>
>> % perftest hosts_c18_8c.txt
>>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>    176.8   185.9   217.0   347.6   387.7  1242.0
>> % perftest hosts_c18_8c.txt
>>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>>    178.2   204.5   390.5   329.6   409.4   493.1
>>
>>    Note that the minimums are similar -- not /every/ run experiences
>>
>> this jitter - and in the case of the first run of the script, even the
>> median value is pretty decent, so seemingly only a few of the tests were
>> high.  But the maximum is enormous.  Each of these tests are run one
>> right after the other, and strangely it seems to always differ between
>> /instances/ of the IMB code, not in individual loops -eg, one of the
>>
>> fifty runs inside an individual call.  Those all seem consistent, so
>> that's either luck, or some issue on mapping the IB device, or some
>> interrupt issue in the kernel, etc.
>>
>
> Median changes by more than factor of 2. And the distribution tail is
> *huge*.
>
> FWIW: 6.2 was a terrible release.  If you have to use pure RHEL, get to
> 6.5+.  And there are many tunables you need to look at.
>
> Bigger view ... have you isolated a CPU for IB handling, so at 7 cores,
> your machine is full (1 for IB and 7 for apps), but at 8 cores you are
> contending for resources (8 for apps + 1 for IB)?
>
> Are you running the app with taskset (explicitly or implicitly)?
>
>
>
>
> --
> Joseph Landman, Ph.D
> Founder and CEO
> Scalable Informatics, Inc.
> email: landman at scalableinformatics.com
> web  : http://scalableinformatics.com
> twtr : @scalableinfo
> phone: +1 734 786 8423 x121
> cell : +1 734 612 4615
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140424/f1f72ca3/attachment.html>