[Beowulf] Good IB network performance when using 7 cores, poor performance on all 8?

Joe Landman landman at scalableinformatics.com
Thu Apr 24 08:56:32 PDT 2014


On 04/24/2014 11:31 AM, Brian Dobbins wrote:
>
> Hi everyone,
>
>    We're having a problem with one of our clusters after it was upgraded
> to RH6.2 (from CentOS5.5) - the performance of our Infiniband network
> degrades randomly and severely when using all 8 cores in our nodes for
> MPI,... but not when using only 7 cores per node.
>
>    For example, I have a hacked-together script (below) that does a
> sequence of 20 sets of fifty MPI_Allreduce tests via the Intel MPI
> benchmarks, and then calculates statistics on the average times per
> individual set.  For our 'good' (CentOS 5.5) nodes, we see consistent
> results:
>
> % perftest hosts_c20_8c.txt
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    176.0   177.3   182.6   182.8   186.1   196.9
> % perftest hosts_c20_8c.txt
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    176.3   180.4   184.8   187.0   189.1   213.5
>
>    ... But for our tests on the RH6.2 install, we see enormous variance:
>
> % perftest hosts_c18_8c.txt
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    176.8   185.9   217.0   347.6   387.7  1242.0
> % perftest hosts_c18_8c.txt
>     Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>    178.2   204.5   390.5   329.6   409.4   493.1
>
>    Note that the minimums are similar -- not /every/ run experiences
> this jitter - and in the case of the first run of the script, even the
> median value is pretty decent, so seemingly only a few of the tests were
> high.  But the maximum is enormous.  Each of these tests are run one
> right after the other, and strangely it seems to always differ between
> /instances/ of the IMB code, not in individual loops -eg, one of the
> fifty runs inside an individual call.  Those all seem consistent, so
> that's either luck, or some issue on mapping the IB device, or some
> interrupt issue in the kernel, etc.

Median changes by more than factor of 2. And the distribution tail is 
*huge*.

FWIW: 6.2 was a terrible release.  If you have to use pure RHEL, get to 
6.5+.  And there are many tunables you need to look at.

Bigger view ... have you isolated a CPU for IB handling, so at 7 cores, 
your machine is full (1 for IB and 7 for apps), but at 8 cores you are 
contending for resources (8 for apps + 1 for IB)?

Are you running the app with taskset (explicitly or implicitly)?




-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615



More information about the Beowulf mailing list