[Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
Tina.Friedrich at diamond.ac.uk
Fri Jun 4 01:39:38 PDT 2010
We've had that happen on some of our servers. Currently using the
disable_msi workaround, which seems to have stopped it. I believe
there's supposed to be a fix in the latest Red Hat kernel but we haven't
really tested that yet.
You loose all network connectivity (including IPMI) to the server - not
all connectivity, so e.g. serial console (not SOL, proper serial
console, or using a console server) still works (as would a locally
attached keyboard/monitor). Unless you require network to log in :) . If
one runs into this, it's a really weird one (before you find the bug
report) - to all appearances, the server works happily, no strangeness
in the logs - just network gone completely.
It's not one to trigger easily - hard to track down sort of thing. Had
610s and 710s for a while before this first happened (and loads we never
saw it on, still). We first saw it on a rather heavily used NFS server
(i.e. lots of network I/O).
Cris Rhea wrote:
>> In case it helps anyone using Dell R410 / 610 / 710 etc. servers: I have had
>> machines lose their eth connections periodically (CentOS 5.4 bnx2 driver).
>> Seems like a bug with the Broadcom NIC drivers. [luckily read of it on a
>> Dell mailing list]
>> Bug Reports:
>> Not sure yet if this is exactly my issue but I'm giving it a shot now.
>> Thought I'd post since, anecdotally I've seen many people use these servers
>> on the list.
> I've been following this on the Dell list as I have approx. 50 R410s
> in our cluster.
> One thing that isn't clear-- When this happens, do you lose all
> connectivity to the node (i.e., do you have to reboot the node to
> re-establish eth0)?
> My R410s are running CentOS 5.2 - 5.4 and I rarely have one go
> --- Cris
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
More information about the Beowulf