Basic Scyld setup node boot problems

Donald Becker becker at scyld.com
Thu Sep 19 09:14:03 PDT 2002


On Wed, 18 Sep 2002, Mesimer, Daniel James (UMKC-Student) wrote:

> I solved my problem by using the beoboot utility to create my boot
> disks, instead of the BeoSetup Node boot disk utility. 
> 
> the command was simply:  beoboot -2 -f
> 
> I just don't understand why this made that big of a difference?

What you have done is put a "Stage 2" image on the floppy.

The Scyld system normally works with a two stage boot
  Stage 1 is a stable, reliable kernel and network driver set that
     exists only to contact a "load master" and download stage 2
  Stage 2 is the final kernel that the compute node will be running

Stage 1 will never change, and may be written onto permanent boot media.
Stage 2 is the "latest and greatest", which might be unstable.

The reason for this approach is that Scyld clusters have single point
administration.  A two stage boot is the mechanism for single-point
updating of kernel and device drivers, while never risking the
bootability of the cluster.  There are other system out there that claim
"single command update" is the same thing, but copying a 100 broken
kernels or device drivers out to your compute nodes will ruin your whole
week.

Still, this approach doesn't always work.  There are various bugs and
hardware problems (usually BIOS bugs) that prevent subsystems, such as
Two Kernel Monte, from working.  So we provide the alternative of
booting directly into the final kernel.  The 'beoboot -2 -f' command
means "make a stage 2 Floppy boot image".

> I am using the Scyld Beowulf 27bz-8 edition.
..
> I, then, made the node boot floppies (again default).  These boot my
slave nodes just fine.  They begin RARPing and I see all of the MAC
addresses in the unknown field just fine.  When I move these addresses
to the Configured Nodes Section (of BeoSetup), (I believe) the slave
nodes then get the phase 2 kernel and begin running it.  
> 
> But, the slave nodes then stop.  
> Stating:
> VFS: Cannot open root device 03:05

You have a BIOS bug that reports the wrong amount of memory, and that
confuses the 'syslinux' subsystem. 
Upgrading your BIOS is the best fix to the problem.

More detail: the BIOS is reporting the wrong amount of free memory to
'syslinux', which results in the kernel boot command line being
corrupted.  We pass the information about the RAMdisk to mount on the
command line, and without that information the kernel tries to mount the
default device.  The default is /dev/hda5, the partition where Linus
Torvalds keeps his "/" filesystem.

There are various fixes, such as telling syslinux how much memory the
system has, but the permanent fix is to upgrade your BIOS so that you
don't run into this problem again.

> This is the syslinux.cfg on the floppies:
> **BEGIN
> default beoboot
> label   beoboot
>   kernel vmlinuz
>   append initrd=initrd.img apm=off apm=power-off

You have obviously done much debuggin on your own!

> This is my (default) /etc/beowulf/fstab:

This isn't used yet.  The /etc/beowulf/fstab file is on the master, and
is used in the /etc/beowulf/node_up script.  The node_up script is
run on the master after the compute node has started accepting commands
from the master (another part of the "single point administration").

> Also, another question, Is there some advantage to using the slave
> nodes' disks? (I mean installing software to them?) 

There can be a significant advantage, depending on the application.
While the base system model is "diskless administration", we strongly
recommend that you have local disks for swap space, scratch files, and
for the ever-improving caching with the later Scyld versions.  In some
cases you will see a noticable performance advantage by pre-caching
libraries and executables on the compute nodes, although you then run
the risk of the risk version skew (different program versions on the
compute nodes) that doesn't exist with the default system.

I hope this provides some information on the philosophy and architecture
of the Scyld system.

-- 
Donald Becker				becker at scyld.com
Scyld Computing Corporation		http://www.scyld.com
410 Severn Ave. Suite 210		Second Generation Beowulf Clusters
Annapolis MD 21403			410-990-9993




More information about the Beowulf mailing list