[Beowulf] InfiniBand VL15 error

Nifty Tom Mitchell niftyompi at niftyegg.com
Tue Dec 2 13:24:14 PST 2008


On Tue, Dec 02, 2008 at 10:24:15AM -0500, Prentice Bisbal wrote:
> 
> I'm getting this error when I run ibchecknet on my cluster:
> 
> #warn: counter VL15Dropped = 476        (threshold 100) lid 1 port 1
> Error check on lid 1 (aurora HCA-1) port 1:  FAILED
> 
> I've googled around this morning, but haven't found anything helpful.
> Most of the hits turn up code with the phrase "VL15Dropped", but nothing
> explaining what this error means, what causes it, or how to fix it.
> 
> After clearing the counters with 'perfquery -r', the VL15Dropped count
> starts increasing from zero almost immediately.
> 
> Any ideas what this error represents or how to fix? Could it be a bad
> cable?
> 

Can you be specific about the hardware (HCA and switch) and software?
How large is the fabric?
What subnet manager is running and where?

The host behind LID-1 is the one of interest.

If I recall correctly, VL15  is reserved exclusively for subnet management
and is not optional.  Traffic to VL15 might be randomly dropped by the
switch, SMA or interrupt handler.  As long as the subnet is OK modest
dropped traffic on VL15 may not be an issue.

What is running on the fabric concurrently with ibchecknet (and on the LID-1 host)?

Subnet management traffic should be light, very light.  Tell us about 
the subnet manager situation on your fabric.   There should only
be one active subnet manager.   Mixed and uncooperating  SMs could
cause this, as could basic IB errors (connectors, cables, connections).
If the SM is running on LID-1 then traffic will reflect the fabric size.

What other IB errors are you seeing..  If the port for LID-1 is not seeing
IB errors other than VL15 you should be OK -- do look for multiple SMs.

If you stop your subnet manager does the counter reflect the pause.


-- 
	T o m  M i t c h e l l 
	Found me a new hat, now what?




More information about the Beowulf mailing list