[Beowulf] InfiniBand VL15 error
Nifty Tom Mitchell
niftyompi at niftyegg.com
Tue Dec 2 16:04:47 PST 2008
On Tue, Dec 02, 2008 at 05:02:59PM -0500, Prentice Bisbal wrote:
> See my answers inline.
> Nifty Tom Mitchell wrote:
> > On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote:
> >> I'm getting this error when I run ibchecknet on my cluster:
> >> #warn: counter VL15Dropped = 476 (threshold 100) lid 1 port 1
> >> Error check on lid 1 (aurora HCA-1) port 1: FAILED
> >> I've googled around this morning, but haven't found anything helpful.
> >> Most of the hits turn up code with the phrase "VL15Dropped", but nothing
> >> explaining what this error means, what causes it, or how to fix it.
> >> After clearing the counters with 'perfquery -r', the VL15Dropped count
> >> starts increasing from zero almost immediately.
> >> Any ideas what this error represents or how to fix? Could it be a bad
> >> cable?
> > Can you be specific about the hardware (HCA and switch) and software?
> > How large is the fabric?
> > What subnet manager is running and where?
> > The host behind LID-1 is the one of interest.
> IB Switch: Cisco 7012 D, 144-port
> HCAs: Cisco, which is really Mellanox:
> # lspci | grep Infini
> 0b:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
> (Tavor compatibility mode) (rev 20)
> The subnet manager is OpenSM 3.1.8-1.el5, which is provided by my Linux
> Distro, PU_IAS 5.2, which is a rebuild of RHEL 5.2. It is running on the
> master node, aurora. The HCA with the error is on this node (see errors
> message in original post).
> > If I recall correctly, VL15 is reserved exclusively for subnet management
> > and is not optional. Traffic to VL15 might be randomly dropped by the
> > switch, SMA or interrupt handler. As long as the subnet is OK modest
> > dropped traffic on VL15 may not be an issue.
> > What is running on the fabric concurrently with ibchecknet (and on the LID-1 host)?
> Not sure what you mean. Do you want to see the output of ibchecknet?
What I was thinking was that on a compute bound system the subnet manager
process might not get enough time to service all the management packets.
In the Mellanox case the card on a local node can have many SMA actions handled
inside the card larger fabric wide actions need interrupts and system time.
If this was an overloaded IO or compute node the subnet manager may not
wake up often enough to handle all the management packets.
i.e. it may be normal and OK with this load, card, software stack and SM to see VL15 drops.
Your Mellanox contact can help answer this...
> > Subnet management traffic should be light, very light. Tell us about
> > the subnet manager situation on your fabric. There should only
> > be one active subnet manager. Mixed and uncooperating SMs could
> > cause this, as could basic IB errors (connectors, cables, connections).
> > If the SM is running on LID-1 then traffic will reflect the fabric size.
> There is only one SM running. It's running on the master node. The other
> nodes don't even have the OpenSM package installed.
> > What other IB errors are you seeing.. If the port for LID-1 is not seeing
> > IB errorsu other than VL15 you should be OK -- do look for multiple SMs.
> I'm not seeing any other errors. This one is a new development, too.
> > If you stop your subnet manager does the counter reflect the pause.
> Haven't tried yet. And since it's almost quitting time, I'm not going to
> try until tomorrow.
Pausing the subnet manager can be diagnostic.
If you pause/ stop the SM and reboot a free node, the free node will
not be assigned a LID. If you have another SM on the fabric it will
get a LID. While multiple subnet managers are legal the interactions
between different versions has too many permutations for good test
coverage. It can be good to 'test' for unexpected subnet managers.
Do revisit your Open SM settings. Sweeps for node status may just be
There is a chance that your opensm is dated.
It look like:
opensm-3.1.8-1.el5.x86_64.rpm. Build Date: Mon Mar 17 14:12:13 2008
Inspect the change log dates ;-0
The current OFED version looks like:
While OFED and rpm versions do not always track consider an update. Also note RH is
slow picking up many OFED changes as the OFED process is a big bang release process.
Other on this list might know if the delta from 3.1.8 to 3.1.11 is important in this regard.
T o m M i t c h e l l
Found me a new hat, now what?
More information about the Beowulf