[Beowulf] Odd Infiniband scaling behaviour

Chris Samuel csamuel at vpac.org
Sun Oct 7 22:24:50 PDT 2007

Hi fellow Beowulfers..

We're currently building an Opteron based IB cluster, and are seeing
some rather peculiar behaviour that has had us puzzled for a while.

If I take a CPU bound application, like NAMD, I can run an 8 CPU job
on a single node and it pegs the CPUs at 100%   (this is built using
Charm++ configured as an MPI system and using MVAPICH 0.9.8p3
with the Portland Group Compilers).

If I then run 2 x 4 CPU jobs of the *same* problem, they all run at
50% CPU.

If I run 4 x 2 CPU jobs, again the same problem, they run at 25%..

..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU!

I then replicated the same problem with the example MPI cpi.c program,
to rule out some odd behaviour in NAMD.

What really surprised me was when testing CPI built using OpenMPI
(which doesn't use IB on our system) the problem vanished and I could
run 8 x 1 CPU jobs, each using 100%!

So (at the moment) it looks like we're seeing some form of contention
on the Infiniband adapter..

07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0)
        Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
        Flags: fast devsel, IRQ 19
        Memory at feb00000 (64-bit, non-prefetchable) [size=1M]
        Memory at fd800000 (64-bit, prefetchable) [size=8M]
        Capabilities: [40] Power Management version 2
        Capabilities: [48] Vital Product Data
        Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
        Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
        Capabilities: [60] Express Endpoint IRQ 0

We see this problem with the standard CentOS kernel, with the latest
stable kernel ( and with 2.6.23-rc9-git5 (which completely
rips out and replaced the CPU scheduler with Ingo Molnar's CFS).

This is on a SuperMicro based system with AMD's Barcelona quad
core CPU (1.9GHz), but I see the same behaviour (scaled down) on dual
core Opterons too.

I've looked at what "modinfo ib_mthca" says are the tuneable options,
but the few I've played with ("msi_x" and "tune_pci") haven't made
any noticeable difference, sadly..

Has anyone else run into this or got any clues they could pass on please ?

Christopher Samuel - (03) 9925 4751 - Systems Manager
 The Victorian Partnership for Advanced Computing
 P.O. Box 201, Carlton South, VIC 3053, Australia
VPAC is a not-for-profit Registered Research Agency
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20071008/2d47448a/attachment.sig>

More information about the Beowulf mailing list