[Beowulf] bizarre scaling behavior on a Nehalem
Craig.Tierney at noaa.gov
Wed Aug 12 11:02:15 PDT 2009
Rahul Nabar wrote:
> On Wed, Aug 12, 2009 at 11:32 AM, Craig Tierney<Craig.Tierney at noaa.gov> wrote:
>> What do you mean normally? I am running Centos 5.3 with 2.6.18-128.2.1
>> right now on a 448 node Nehalem cluster. I am so far happy with how things work.
>> The original Centos 5.3 kernel, 2.6.18-128.1.10 had bugs in Nelahem support
>> where nodes would just start randomly run slow. Upgrading the kernel
>> fixed that. But that performance problem was either all or none, I don't recall
>> it exhibiting itself in the way that Rahul described.
> For me it shows:
> Linux version 2.6.18-128.el5 (mockbuild at builder10.centos.org)
> I am a bit confused with the numbering scheme, now. Is this older or
> newer than Craigs? You are right Craig, I haven't noticed any random
> slowdowns but my data is statistically sparse. I only have a single
> Nehalem+CentOS test node right now.
When you run uname -a you don't get something like:
[ctierney at wfe7 serial]$ uname -a
Linux wfe7 2.6.18-128.2.1.el5 #1 SMP Thu Aug 6 02:00:18 GMT 2009 x86_64 x86_64 x86_64 GNU/Linux
We did build our kernel from source, only because we ripped out
the IB so we could build from the latest OFED stack.
# rpm -qa | grep kernel
And see what version is listed.
We have found a few performance problems so far.
1) Nodes would start going slow, really slow. However, when they started
to go slow they stayed slow and the problem was cleared by a reboot. This
problem was resolved by upgrading to the kernel we use now.
2) Nodes are reporting too many System Events that look like single-bit
errors. This again would show up as nodes that would start to go slow, and
wouldn't be resolved until a reboot. We no longer things we had lots of
bad memory, and the latest BIOS may have fixed it. We are upload that bios
now and will start checking.
The only time I was getting variability in timings was when I wasn't pinning
processes and memory correctly. My tests have always used all the cores
in a node though. I think that OpenMPI is doing the correct thing
with mpi_affinity_alone. For mvapich, we wrote a wrapper script (similar to
TACC) that uses numactl directly to pin memory and threads.
Craig Tierney (craig.tierney at noaa.gov)
More information about the Beowulf