[Beowulf] Good IB network performance when using 7 cores, poor performance on all 8?
bdobbins at gmail.com
Thu Apr 24 10:31:30 PDT 2014
Median changes by more than factor of 2. And the distribution tail is
> FWIW: 6.2 was a terrible release. If you have to use pure RHEL, get to
> 6.5+. And there are many tunables you need to look at.
Thanks for your reply - I may look into asking our IT squad to put 6.5 on
a set of nodes for testing, but playing with the tunables is probably the
first step. I don't have root access and can't switch things up, but a few
of the power options (eg, /sys/module/pcie_aspm/parameters/policy) are
already looking like decent things to switch around, as that's in a 'power
save' state currently on the poorly performing nodes, whereas it doesn't
even exist on the 5.5 nodes.
> Bigger view ... have you isolated a CPU for IB handling, so at 7 cores,
> your machine is full (1 for IB and 7 for apps), but at 8 cores you are
> contending for resources (8 for apps + 1 for IB)?
> Are you running the app with taskset (explicitly or implicitly)?
In the test we're running, there isn't any local processing outside of
the communication, really - each task, bound to its own core, is simply
sending messages, in a giant loop. While there are clearly 8 cores all
talking to 1 IB device, each one (I believe) mmaps its own range and
handles its own message processing, and furthermore this definitely works
before, so it doesn't seem like a resource contention issue unless it's
something to do with mmap on the versions we're running. I did double
check that we're not having processes migrating between cores, though.
Mostly, I'm poking around kernel tunables right now and making a list of
things that might indicate the issue. I'll also take a deeper look at
/proc/interrupts during a run soon, too.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf