[Beowulf] number of admins
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Chris Dagdigian dag at sonsorol.orgWed Jun 8 09:02:29 PDT 2005
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
My $.02 The number of sysadmins required is a function of how much infrastructure you have in place to reduce operational burden: - remote power control over all nodes - remote access to BIOS on all nodes via serial console - remote access to system console via serial port on all nodes - unattended/automatic OS installation onto bare metal (autoYast, kickstart, systemimger etc.) - unattended/automatic OS incremental updates to running nodes - documented plan for handling node hardware failures which includes specific info on when and how an admin is expected to spend time diagnosing a problem versus when the admin can just hand the node off to a vendor or someone else for simple planned replacement or advanced troubleshooting. For Dell systems you want to have an agreement in place where your sysadmin can make a judgement call that a node needs replacement WITHOUT having to first wade through the hell that is Dell's first tier of customer support. If you have the infrastructure in place where your admin(s) can do everything remotely including OS installs, console access and remote power control then you may be able to get away with a single admin (as long as his/her job is tightly scoped to keeping the cluster functional). If you have not pre-planned your architecture to make administration as easy and as "hands off" as possible then you are going to need many hands. The biggest reason for cluster deployment unhappiness can be traced to this: - management and users expect the cluster operators to also also be experts with HPC programming, the applications in use, application integration issues and the cluster scheduler. This almost never works out well as the skills and background needed to keep a cluster running are often quite different from the expertise needed to understand the local research efforts and internal application mix. This is not a good thing to be doing. The cluster sysadmins should be focused on the OS, hardware, interconnects and infrastructure. You probably need some additional staff resources to specifically cover: o Someone who understands the research/work and can talk to end users intelligently about how to use/integrate/run/troubleshoot the the cluster application mix. This person needs to understand the science, research and applications involved and probably also needs to be a bit of a shell/perl toolsmith who can assist with workflows and application integration. This person could actually be recruited from the ranks of the users if there is a particular expert "power user" who would be interested in the role. o Someone who understands high performance scientific software development who can help the cluster admins deal with and troubleshoot the Myrinet interconnect while also being able to help the end users with HPC compiler issues, software dev issues and application optimization issues So the big message in my mind is: o cluster operators should not expected to be application experts o cluster operators should not expected to be HPC coding & scientific software development expers o Significant effort needs to be put into training users how to use the cluster and the interconnect Short term you may also need a LSF expert on hand to help get the cluster resource policies sorted but that is short term only as the cluster admins can pick up the LSF specifics very quickly and easily. -Chris bioteam.net On Jun 6, 2005, at 6:23 PM, David Kewley wrote: > Hi all, > > We expect to get a large new cluster here, and I'd like to draw on the > expertise on this list to educate management about the personnel > needed. > > The cluster is expected to be: > > ~1000 Dell PE1850 dual CPU compute nodes > master & other auxiliary nodes on similar hardware > 1024-port Myrinet > Nortel stacked-switches-based GigE network > many-TB SAN built on Data Direct & Ibrix > Platform Rocks > Platform LSF HPC Rocks roll > Moab added later, quite possibly > tape library backup (software TBD) > NFS service to public workstations > nine man-weeks of Dell installation support > 10 man-days of Ibrix installation support > > The users will be something like: > > ~10 local academic groups, perhaps 60 users total > several different locally-written or -customized codebases > at least one near-real-time application with public exposure > > We have some experience already with a 160-node Dell cluster that has > some of the basic elements listed above, but several of the pieces > will > be totally new, and some of the pieces we already have will need > greater care. > > My questions to you are: > > * How many sysadmins should we plan to have once the cluster is > stable? > * Is there indeed any such thing as a "stable" cluster of this > sort, and > if so, should we get additional help during the initial phase of the > project, when things are less stable (help beyond the vendor > installation support listed above)? > * If we need more help in the initial phases, how might we go about > finding people? Contract workers? Commercial or private > consultancies? > * Should we look for any specific non-obvious skillset, or would > skilled > sysadmins be adequate? > > And finally: > > * If we only have one sysadmin, someone who is bright and capable, but > is learning as they go, is that too small a support staff? > * If one such sysadmin is too little, then what would you expect the > impact on the users to be? > > I have been giving my opinion to management, but I'd really like to > get > (relatively unbiased) professional opinions from outside as well. I > thank you for any comments you can make! > > David > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
