[Beowulf] Odd Infiniband scaling behaviour

Kevin Ball kevin.ball at qlogic.com
Mon Oct 8 10:18:42 PDT 2007

Hi Chris,

  I'm not an expert on the Mellanox IB implementation, or MVAPICH, and I
won't try to be.  Gilead/someone else from Mellanox might be able to
give you more specific information, or maybe the OSU guys if you email

  I have two guesses, based on things I've seen on a range of networks.

  Guess 1)

  The numbers you cite look remarkably as though you are running on 8
core nodes with 8, 4, 2, and 1 cores active, and citing the %
utilization of the entire node.  If you run top without separating out
into per-cpu load (via hitting '1' while top) you will, on some OS's (I
think I've seen some variation of behavior) see utilization as a
percentage of total available CPU.  Thus 1 CPU on an 8 core node would
show up as 12.5% of CPU.

  I'm not sure this is what you're seeing, especially since you're not
seeing it with OpenMPI, but if the OpenMPI implementation you're using
uses threads for various purposes, that might explain it.

  Guess 2)

  Particularly in network devices that offload communication from the
CPU, if the MPI implementation uses an interrupt-driven communication
approach you can get a lot of idle time while waiting for data to
arrive.  This can lead to very large amounts of idle time.  An
implementation that polls for data will not show this idle time, so you
can see dramatic differences in CPU utilization even though with regards
to the job at hand, the same amount of progress is being made.

  I think that the default configuration of MVAPICH does poll for data,
so you would not see lots of idle CPU, but MVAPICH is configurable to
the moon and back so how you have it built, I have no idea.

  Hope this helps!

On Sun, 2007-10-07 at 22:24, Chris Samuel wrote:
> Hi fellow Beowulfers..
> We're currently building an Opteron based IB cluster, and are seeing
> some rather peculiar behaviour that has had us puzzled for a while.
> If I take a CPU bound application, like NAMD, I can run an 8 CPU job
> on a single node and it pegs the CPUs at 100%   (this is built using
> Charm++ configured as an MPI system and using MVAPICH 0.9.8p3
> with the Portland Group Compilers).
> If I then run 2 x 4 CPU jobs of the *same* problem, they all run at
> 50% CPU.
> If I run 4 x 2 CPU jobs, again the same problem, they run at 25%..
> ..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU!
> I then replicated the same problem with the example MPI cpi.c program,
> to rule out some odd behaviour in NAMD.
> What really surprised me was when testing CPI built using OpenMPI
> (which doesn't use IB on our system) the problem vanished and I could
> run 8 x 1 CPU jobs, each using 100%!
> So (at the moment) it looks like we're seeing some form of contention
> on the Infiniband adapter..
> 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] (rev a0)
>         Subsystem: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]
>         Flags: fast devsel, IRQ 19
>         Memory at feb00000 (64-bit, non-prefetchable) [size=1M]
>         Memory at fd800000 (64-bit, prefetchable) [size=8M]
>         Capabilities: [40] Power Management version 2
>         Capabilities: [48] Vital Product Data
>         Capabilities: [90] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable-
>         Capabilities: [84] MSI-X: Enable- Mask- TabSize=32
>         Capabilities: [60] Express Endpoint IRQ 0
> We see this problem with the standard CentOS kernel, with the latest
> stable kernel ( and with 2.6.23-rc9-git5 (which completely
> rips out and replaced the CPU scheduler with Ingo Molnar's CFS).
> This is on a SuperMicro based system with AMD's Barcelona quad
> core CPU (1.9GHz), but I see the same behaviour (scaled down) on dual
> core Opterons too.
> I've looked at what "modinfo ib_mthca" says are the tuneable options,
> but the few I've played with ("msi_x" and "tune_pci") haven't made
> any noticeable difference, sadly..
> Has anyone else run into this or got any clues they could pass on please ?
> cheers,
> Chris

More information about the Beowulf mailing list