[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

Joe Landman landman at scalableinformatics.com
Wed Dec 2 10:30:02 PST 2009


Art Poon wrote:
> Dear colleagues,

[...]

> What's got me and the IT guys stumped is that while the compute nodes
> boot via PXE from the head node without trouble on the NetGear, they
> barf with the SMC.  To be specific, after the initial boot with a
> minimal Linux kernel, there is a "fatal error" with "timeout waiting
> for getfile" when the compute node attempts to download the
> provisioning image from head.  However, when they were running Rocks
> before I arrived, the cluster worked fine with the SMC switch.

Is it the switch of the dhcp/bootp/tftp setup thats the problem?  Are 
you sure the tftp daemon is up, or bootp is configured correctly?

Switches sometimes have broadcast storm suppression turned on, or worse, 
sometimes they have spanning tree turned on.  You want the switch to be 
as dumb as you can possibly make it for most linux clusters.  Fast, but 
dumb.

> I've tried resetting the SMC switch to factory defaults (with
> auto-negotiate on).  I've checked the /etc/beowulf/modprobe.conf and
> it doesn't seem to be demanding anything exotic.  We've tried
> swapping out to another SMC switch but that didn't change anything.

This sounds more on the server software stack than the switch.  Could 
you describe this?  Are you using Scyld/Rocks for that?

Rocks is quite sensitive to configuration issues, and really doesn't 
like altered configurations (it is possible to do, though non-trivial).

-- 
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics Inc.
email: landman at scalableinformatics.com
web  : http://scalableinformatics.com
        http://scalableinformatics.com/jackrabbit
phone: +1 734 786 8423 x121
fax  : +1 866 888 3112
cell : +1 734 612 4615



More information about the Beowulf mailing list