[Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Tina Friedrich Tina.Friedrich at diamond.ac.ukFri Jun 4 01:39:38 PDT 2010
- Previous message: [Beowulf] recommendations for parallel IO
- Next message: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
We've had that happen on some of our servers. Currently using the disable_msi workaround, which seems to have stopped it. I believe there's supposed to be a fix in the latest Red Hat kernel but we haven't really tested that yet. You loose all network connectivity (including IPMI) to the server - not all connectivity, so e.g. serial console (not SOL, proper serial console, or using a console server) still works (as would a locally attached keyboard/monitor). Unless you require network to log in :) . If one runs into this, it's a really weird one (before you find the bug report) - to all appearances, the server works happily, no strangeness in the logs - just network gone completely. It's not one to trigger easily - hard to track down sort of thing. Had 610s and 710s for a while before this first happened (and loads we never saw it on, still). We first saw it on a rather heavily used NFS server (i.e. lots of network I/O). Tina Cris Rhea wrote: >> In case it helps anyone using Dell R410 / 610 / 710 etc. servers: I have had >> machines lose their eth connections periodically (CentOS 5.4 bnx2 driver). >> Seems like a bug with the Broadcom NIC drivers. [luckily read of it on a >> Dell mailing list] >> >> Bug Reports: >> >> http://kbase.redhat.com/faq/docs/DOC-26837 >> http://patchwork.ozlabs.org/patch/51106 >> >> Not sure yet if this is exactly my issue but I'm giving it a shot now. >> Thought I'd post since, anecdotally I've seen many people use these servers >> on the list. >> >> -- >> Rahul > > I've been following this on the Dell list as I have approx. 50 R410s > in our cluster. > > One thing that isn't clear-- When this happens, do you lose all > connectivity to the node (i.e., do you have to reboot the node to > re-establish eth0)? > > My R410s are running CentOS 5.2 - 5.4 and I rarely have one go > down. > > --- Cris > > -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
- Previous message: [Beowulf] recommendations for parallel IO
- Next message: [Beowulf] Re: Bugfix for Broadcom NICs losing connectivity
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
