[Beowulf] Odd Infiniband scaling behaviour
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tom Elken tom.elken at qlogic.comMon Oct 8 09:32:55 PDT 2007
- Previous message: [Beowulf] Odd Infiniband scaling behaviour
- Next message: [Beowulf] Odd Infiniband scaling behaviour
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> -----Original Message----- > [mailto:beowulf-bounces at beowulf.org] On Behalf Of Chris Samuel > Sent: Sunday, October 07, 2007 10:25 PM > To: beowulf at beowulf.org > Subject: [Beowulf] Odd Infiniband scaling behaviour > > Hi fellow Beowulfers.. > > We're currently building an Opteron based IB cluster, and are > seeing some rather peculiar behaviour that has had us puzzled > for a while. To give us more info about your "scaling" problem, can you tell us 1) the elapsed run-time of the four scenarios you mention (or relative run-times)? 2) how you measured the CPU usage? Thanks, Tom > > If I take a CPU bound application, like NAMD, I can run an 8 CPU job > on a single node and it pegs the CPUs at 100% (this is built using > Charm++ configured as an MPI system and using MVAPICH 0.9.8p3 > with the Portland Group Compilers). > > If I then run 2 x 4 CPU jobs of the *same* problem, they all > run at 50% CPU. > > If I run 4 x 2 CPU jobs, again the same problem, they run at 25%.. > > ..and yes, if I run 8 x 1 CPU jobs they run at around 12-13% CPU! > > I then replicated the same problem with the example MPI cpi.c > program, to rule out some odd behaviour in NAMD. > > What really surprised me was when testing CPI built using > OpenMPI (which doesn't use IB on our system) the problem > vanished and I could run 8 x 1 CPU jobs, each using 100%! > > So (at the moment) it looks like we're seeing some form of > contention on the Infiniband adapter.. > > 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost > III Lx HCA] (rev a0) > Subsystem: Mellanox Technologies MT25204 [InfiniHost > III Lx HCA] > Flags: fast devsel, IRQ 19 > Memory at feb00000 (64-bit, non-prefetchable) [size=1M] > Memory at fd800000 (64-bit, prefetchable) [size=8M] > Capabilities: [40] Power Management version 2 > Capabilities: [48] Vital Product Data > Capabilities: [90] Message Signalled Interrupts: > 64bit+ Queue=0/5 Enable- > Capabilities: [84] MSI-X: Enable- Mask- TabSize=32 > Capabilities: [60] Express Endpoint IRQ 0 > > We see this problem with the standard CentOS kernel, with the > latest stable kernel (2.6.22.9) and with 2.6.23-rc9-git5 > (which completely rips out and replaced the CPU scheduler > with Ingo Molnar's CFS). > > This is on a SuperMicro based system with AMD's Barcelona > quad core CPU (1.9GHz), but I see the same behaviour (scaled > down) on dual core Opterons too. > > I've looked at what "modinfo ib_mthca" says are the tuneable > options, but the few I've played with ("msi_x" and > "tune_pci") haven't made any noticeable difference, sadly.. > > Has anyone else run into this or got any clues they could > pass on please ? > > cheers, > Chris > -- > Christopher Samuel - (03) 9925 4751 - Systems Manager The > Victorian Partnership for Advanced Computing P.O. Box 201, > Carlton South, VIC 3053, Australia VPAC is a not-for-profit > Registered Research Agency >
- Previous message: [Beowulf] Odd Infiniband scaling behaviour
- Next message: [Beowulf] Odd Infiniband scaling behaviour
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
