[Beowulf] number of admins
agrajag at dragaera.net
Wed Jun 8 10:01:06 PDT 2005
David Kewley wrote:
> My questions to you are:
> * How many sysadmins should we plan to have once the cluster is stable?
> * If we only have one sysadmin, someone who is bright and capable, but
> is learning as they go, is that too small a support staff?
> * If one such sysadmin is too little, then what would you expect the
> impact on the users to be?
The answer to this question depends on what kind of non-sysadmin support
staff will be around. I personally am the sole sysadmin for a ~300 node
cluster with 7 research groups and over 100 users. Under current
conditions, I can probably scale up towards 1000 nodes without problem.
With that said, I have a lot of non-sysadmin support to help me out.
There's a guy who does a lot of scientific computing support. He helps
researchers write/optimize code, and also helps them out with issues
with Fortran compilers, etc. I also have an 24-hour Operations staff I
can rely on. They take care of the server room. If I never need a node
rebooted or anything like that, I can give them a call and they'll take
care of it. When a piece of hardware breaks, I pass the information on
to them so they can sit on hold with Dell, handle
shipping/receiving/etc, and all I have to do is turn the box off and
switch out the part. Without these people to help me, I'd probably be
at my limit now. But because I have them to help, I can handle a bit more.
I don't think my man hours are stretched by the number of nodes as much
as by the number of user requests and the number of different hardware
models in the cluster. Those are the things that can really eat up time.
As for specific skills to look for, I recommend someone who knows the
Linux distro well and is familiar with maintaining a large number of
identical machines (clustered or not). We use CentOS (Red Hat
Enterprise Linux clone) on the cluster. By using tools like yum and
kickstart I've been able to minimize the amount of work required to keep
up with hundreds of OS images. These same technologies are regularly
used in computing labs, large web server farms, etc. While I came into
this job already familiar with beowulf technology, I'm not MPI expert,
and I hadn't even used SGE (our scheduler of choice) before I got here.
I was able to pick up all the SGE knowledge I needed while I was here.
The things that I think really made it work well for me is knowing
what I did about yum and kickstart so that I could have time to learn
SGE and handle user requests.
I hope this helps.
More information about the Beowulf