[Beowulf] Introduction and question

Bill Broadley bill at cse.ucdavis.edu
Thu Feb 28 00:41:57 PST 2019


Yes you belong!  Welcome to the list.

There's many different ways to run a cluster.  But my recommendations:

* Making the clusters as identical as possible.

* setup ansible roles for head node, NAS, and compute node

* avoid installing/fixing things with vi/apt-get/dpkg/yum/dnf, use ansible
  whenever possible.  Eventually you'll have to reinstall and it's painful
  to manually apply months of changes.

* Use environment modules, never have users manually running "export
  LD_LIBRARY_PATH=..."

* Use slurm partitions to keep significantly different hardware in different
  pools so users have an easy time of knowing what to run where.

* Set ALL compute nodes to netboot, then configure cobbler to tell them to
  boot from local disk normally.  That way you don't have to manually power on,
  wait for bios, select netboot 30 times to install 30 nodes.

* enable/configure IPMI at least for power on/off (if available).  Write wrapper
  scripts called pon and poff or similar.

* Keep working on getting cobbler+ansible can reinstall a compute node and it
  will power off, enable netboot, power on, pxe install, reboot, run ansible,
  enable automount, and run slurmd.   Write a wrapper script for netboot-enable
  and netboot disable, I used bon and boff.

The above isn't the only way to do it, but it's a reasonable starting point.
It's really nice for users to just be able to browse apps and say "module load
<app>.  As a SysAdmin it's nice to be able to reinstall any wonky nodes and not
have to play the "what X things do I need to do before it can run jobs" game.

Good luck, have fun, and keep us posted.


More information about the Beowulf mailing list