[Beowulf] Most common cluster management software, job schedulers, etc?

Tue Mar 8 14:49:35 PST 2016

On 08/03/16 15:43, Jeff Friedman wrote:

> Hello all. I am just entering the HPC Sales Engineering role, and would
> like to focus my learning on the most relevant stuff. I have searched
> near and far for a current survey of some sort listing the top used
> “stacks”, but cannot seem to find one that is free. I was breaking
> things down similar to this:

All the following is just for us, but in your role you'll probably need
to be familiar with most options I would have thought based on customer
requirements.  Specialisation for your preferred suite is down to you of
course!

> _OS disto_:  CentOS, Debian, TOSS, etc?  I know some come trimmed down,
> and also include specific HPC libraries, like CNL, CNK, INK?  

RHEL - hardware support attitude of "we support both types of Linux,
RHEL and SLES".

> _MPI options_: MPICH2, MVAPICH2, Open MPI, Intel MPI, ? 

Open-MPI

> _Provisioning software_: Cobbler, Warewulf, xCAT, Openstack, Platform HPC, ?

xCAT

> _Configuration management_: Warewulf, Puppet, Chef, Ansible, ? 

xCAT

We use Puppet on for infrastructure VMs (running Debian).

> _Resource and job schedulers_: I think these are basically the same
> thing? Torque, Lava, Maui, Moab, SLURM, Grid Engine, Son of Grid Engine,
> Univa, Platform LSF, etc… others?

Yes and no - we run Slurm and use its own scheduling mechanisms but you
could plug in Moab should you wish.

Torque has an example pbs_sched but that's just a FIFO, you'd want to
look at Maui or Moab for more sophisticated scheduling.

> _Shared filesystems_: NFS, pNFS, Lustre, GPFS, PVFS2, GlusterFS, ? 

GPFS here - copes well with lots of small files (looks at one OpenFOAM
project that has over 19 million files & directories - mostly
directories - and sighs).

> _Library management_: Lmod, ? 

I've been using environment modules for almost a decade now but our
recent cluster has switched to Lmod.

> _Performance monitoring_: Ganglia, Nagios, ?

We use Icinga for monitoring infrastructure, including polling xCAT and
Slurm for node information such as error LEDs, down nodes, etc.

We have pnp4nagios integrated with our Icinga to record time series
information about memory usage, etc.

> _Cluster management toolkits_: I believe these perform many of the
> functions above, all wrapped up in one tool?  Rocks, Oscar, Scyld, Bright, ?

N/A here.

All the best!
Chris
-- 
 Christopher Samuel        Senior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: samuel at unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/      http://twitter.com/vlsci