[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK
Greg at keller.net
Thu Dec 3 12:17:56 PST 2009
>>>> What's got me and the IT guys stumped is that while the compute
>>> boot via PXE from the head node without trouble on the NetGear, they
>>> barf with the SMC. To be specific, after the initial boot with a
>>> minimal Linux kernel, there is a "fatal error" with "timeout
>>> waiting for
>>> getfile" when the compute node attempts to download the provisioning
>>> image from head. However, when they were running Rocks before I
>>> arrived, the cluster worked fine with the SMC switch.
This is very common with Spanning tree enabled. Essentially, once the
port has a physical link light it may take a while before spanning
tree allows traffic to actually flow through the port. Longer than a
typical timeout. When loading/reloading the driver there seems to be
an instantaneous drop of the link that forces a new delay cycle.
With the Dell PowerConnect (SMC Rebrand??) series you have to "enable"
port fast or "disable" spanning tree to avoid this delay before
traffic passes. I generally do both. The Web based GUI is
sufficiently bad enough to make this more difficult than it needs to
be, but you can globally disable spanning tree through it. I use the
command line, connect to interface range all, and then configure my
interface range ethernet all
Hope this helps!
R Systems NA, inc.
More information about the Beowulf