[Beowulf] InfiniBand VL15 error
prentice at ias.edu
Tue Dec 2 14:02:59 PST 2008
See my answers inline.
Nifty Tom Mitchell wrote:
> On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote:
>> I'm getting this error when I run ibchecknet on my cluster:
>> #warn: counter VL15Dropped = 476 (threshold 100) lid 1 port 1
>> Error check on lid 1 (aurora HCA-1) port 1: FAILED
>> I've googled around this morning, but haven't found anything helpful.
>> Most of the hits turn up code with the phrase "VL15Dropped", but nothing
>> explaining what this error means, what causes it, or how to fix it.
>> After clearing the counters with 'perfquery -r', the VL15Dropped count
>> starts increasing from zero almost immediately.
>> Any ideas what this error represents or how to fix? Could it be a bad
> Can you be specific about the hardware (HCA and switch) and software?
> How large is the fabric?
> What subnet manager is running and where?
> The host behind LID-1 is the one of interest.
IB Switch: Cisco 7012 D, 144-port
HCAs: Cisco, which is really Mellanox:
# lspci | grep Infini
0b:00.0 InfiniBand: Mellanox Technologies MT25208 InfiniHost III Ex
(Tavor compatibility mode) (rev 20)
The subnet manager is OpenSM 3.1.8-1.el5, which is provided by my Linux
Distro, PU_IAS 5.2, which is a rebuild of RHEL 5.2. It is running on the
master node, aurora. The HCA with the error is on this node (see errors
message in original post).
> If I recall correctly, VL15 is reserved exclusively for subnet management
> and is not optional. Traffic to VL15 might be randomly dropped by the
> switch, SMA or interrupt handler. As long as the subnet is OK modest
> dropped traffic on VL15 may not be an issue.
> What is running on the fabric concurrently with ibchecknet (and on the LID-1 host)?
Not sure what you mean. Do you want to see the output of ibchecknet?
> Subnet management traffic should be light, very light. Tell us about
> the subnet manager situation on your fabric. There should only
> be one active subnet manager. Mixed and uncooperating SMs could
> cause this, as could basic IB errors (connectors, cables, connections).
> If the SM is running on LID-1 then traffic will reflect the fabric size.
There is only one SM running. It's running on the master node. The other
nodes don't even have the OpenSM package installed.
> What other IB errors are you seeing.. If the port for LID-1 is not seeing
> IB errors other than VL15 you should be OK -- do look for multiple SMs.
I'm not seeing any other errors. This one is a new development, too.
> If you stop your subnet manager does the counter reflect the pause.
Haven't tried yet. And since it's almost quitting time, I'm not going to
try until tomorrow.
More information about the Beowulf