[Beowulf] number of admins

Sean Dilda agrajag at dragaera.net
Wed Jun 8 10:01:06 PDT 2005

David Kewley wrote:

> My questions to you are:
> * How many sysadmins should we plan to have once the cluster is stable?
> * If we only have one sysadmin, someone who is bright and capable, but 
> is learning as they go, is that too small a support staff?
> * If one such sysadmin is too little, then what would you expect the 
> impact on the users to be?

The answer to this question depends on what kind of non-sysadmin support 
staff will be around.  I personally am the sole sysadmin for a ~300 node 
cluster with 7 research groups and over 100 users.  Under current 
conditions, I can probably scale up towards 1000 nodes without problem.

With that said, I have a lot of non-sysadmin support to help me out. 
There's a guy who does a lot of scientific computing support.  He helps 
researchers write/optimize code, and also helps them out with issues 
with Fortran compilers, etc.  I also have an 24-hour Operations staff I 
can rely on.  They take care of the server room.  If I never need a node 
rebooted or anything like that, I can give them a call and they'll take 
care of it.  When a piece of hardware breaks, I pass the information on 
to them so they can sit on hold with Dell, handle 
shipping/receiving/etc, and all I have to do is turn the box off and 
switch out the part.  Without these people to help me, I'd probably be 
at my limit now.  But because I have them to help, I can handle a bit more.

I don't think my man hours are stretched by the number of nodes as much 
as by the number of user requests and the number of different hardware 
models in the cluster.  Those are the things that can really eat up time.

As for specific skills to look for, I recommend someone who knows the 
Linux distro well and is familiar with maintaining a large number of 
identical machines (clustered or not).  We use CentOS (Red Hat 
Enterprise Linux clone) on the cluster.  By using tools like yum and 
kickstart I've been able to minimize the amount of work required to keep 
up with hundreds of OS images.  These same technologies are regularly 
used in computing labs, large web server farms, etc.  While I came into 
this job already familiar with beowulf technology, I'm not MPI expert, 
and I hadn't even used SGE (our scheduler of choice) before I got here. 
  I was able to pick up all the SGE knowledge I needed while I was here. 
  The things that I think really made it work well for me is knowing 
what I did about yum and kickstart so that I could have time to learn 
SGE and handle user requests.

I hope this helps.


More information about the Beowulf mailing list