[Beowulf] number of admins

Chris Dagdigian dag at sonsorol.org
Wed Jun 8 09:02:29 PDT 2005

My $.02

The number of sysadmins required is a function of how much  
infrastructure you have in place to reduce operational burden:

  - remote power control over all nodes

  - remote access to BIOS on all nodes via serial console

  - remote access to system console  via serial port on all nodes

  - unattended/automatic OS installation onto bare metal (autoYast,  
kickstart, systemimger etc.)

  - unattended/automatic OS incremental updates to running nodes

  - documented plan for handling node hardware failures which  
includes specific info on when and how an admin is expected to spend  
time diagnosing a problem versus when the admin can just hand the  
node off to a vendor or someone else for simple planned replacement  
or advanced troubleshooting.  For Dell systems you want to have an  
agreement in place where your sysadmin can make a judgement call that  
a node needs replacement WITHOUT having to first wade through the  
hell that is Dell's first tier of customer support.

If you have the infrastructure in place where your admin(s) can do  
everything remotely including OS installs, console access and remote  
power control then you may be able to get away with a single admin  
(as long as his/her job is tightly scoped to keeping the cluster  
functional). If you have not pre-planned your architecture to make  
administration as easy and as "hands off" as possible then you are  
going to need many hands.

The biggest reason for cluster deployment unhappiness can be traced  
to this:

  - management and users expect the cluster operators to also also be  
experts with HPC programming, the applications in use, application  
integration issues and the cluster scheduler. This almost never works  
out well as the skills and background needed to keep a cluster  
running are often quite different from the expertise needed to  
understand the local research efforts and internal application mix.

This is not a good thing to be doing. The cluster sysadmins should be  
focused on the OS, hardware, interconnects and infrastructure.

You probably need some additional staff resources to specifically cover:

  o Someone who understands the research/work and can talk to end  
users intelligently about how to use/integrate/run/troubleshoot the  
the cluster application mix. This person needs to understand the  
science, research and applications involved and probably also needs  
to be a bit of a shell/perl toolsmith who can assist with workflows  
and application integration. This person could actually be recruited  
from the ranks of the users if there is a particular expert "power  
user" who would be interested in the role.

o Someone who understands high performance scientific software  
development who can help the cluster admins deal with and  
troubleshoot the Myrinet interconnect while also being able to help  
the end users with HPC compiler issues, software dev issues and  
application optimization issues

So the big message in my mind is:

  o cluster operators should not expected to be application experts
  o cluster operators should not expected to be HPC coding &  
scientific software development expers
  o Significant effort needs to be put into training users how to use  
the cluster and the interconnect

Short term you may also need a LSF expert on hand to help get the  
cluster resource policies sorted but that is short term only as the  
cluster admins can pick up the LSF specifics very quickly and easily.


On Jun 6, 2005, at 6:23 PM, David Kewley wrote:

> Hi all,
> We expect to get a large new cluster here, and I'd like to draw on the
> expertise on this list to educate management about the personnel
> needed.
> The cluster is expected to be:
> ~1000 Dell PE1850 dual CPU compute nodes
> master & other auxiliary nodes on similar hardware
> 1024-port Myrinet
> Nortel stacked-switches-based GigE network
> many-TB SAN built on Data Direct & Ibrix
> Platform Rocks
> Platform LSF HPC Rocks roll
> Moab added later, quite possibly
> tape library backup (software TBD)
> NFS service to public workstations
> nine man-weeks of Dell installation support
> 10 man-days of Ibrix installation support
> The users will be something like:
> ~10 local academic groups, perhaps 60 users total
> several different locally-written or -customized codebases
> at least one near-real-time application with public exposure
> We have some experience already with a 160-node Dell cluster that has
> some of the basic elements listed above, but several of the pieces  
> will
> be totally new, and some of the pieces we already have will need
> greater care.
> My questions to you are:
> * How many sysadmins should we plan to have once the cluster is  
> stable?
> * Is there indeed any such thing as a "stable" cluster of this  
> sort, and
> if so, should we get additional help during the initial phase of the
> project, when things are less stable (help beyond the vendor
> installation support listed above)?
> * If we need more help in the initial phases, how might we go about
> finding people?  Contract workers?  Commercial or private
> consultancies?
> * Should we look for any specific non-obvious skillset, or would  
> skilled
> sysadmins be adequate?
> And finally:
> * If we only have one sysadmin, someone who is bright and capable, but
> is learning as they go, is that too small a support staff?
> * If one such sysadmin is too little, then what would you expect the
> impact on the users to be?
> I have been giving my opinion to management, but I'd really like to  
> get
> (relatively unbiased) professional opinions from outside as well.  I
> thank you for any comments you can make!
> David
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit  
> http://www.beowulf.org/mailman/listinfo/beowulf

More information about the Beowulf mailing list