[Beowulf] number of admins
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bill Rankin wrankin at ee.duke.eduWed Jun 8 11:26:15 PDT 2005
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Chris Dagdigian wrote: > My $.02 Actually the advice given is worth more than that - it's pretty much right on target. I have a couple additions. > The number of sysadmins required is a function of how much > infrastructure you have in place to reduce operational burden: > > - remote power control over all nodes > > - remote access to BIOS on all nodes via serial console > > - remote access to system console via serial port on all nodes Alternately, the above "good ideas" can be met by 1) a crash cart with a display/keyboard/mouse, and 2) a 24/7 ops staff. A quick work on facilities - a 1000 node cluster is quite an installation and facilities nightmare. You are looking 100 tons of AC and 400kVA of power (at least). No to mention UPS and generator backup. Get outside consultants to come in and assess your needs. Make sure that they know clusters. I have met many people who have years of machine room operational experience, but still have difficulty wrapping their head around the concept of a single *rack* of equipment that can radiate 14kW+ of heat. Much less 25 or so of these racks packed shoulder to shoulder. If you do not have an around-the-clock on-site staff, then you will need to sit down and carefully run through a couple scenarios - 1) it's 3am on a Saturday night and you lose one of your coolers while the cluster is at full load. You have about about 15-30 minutes before you start seeing hardware melting - who shows up, how long does it take them and what do they do? 2) same day, same time - you have a water pipe burst. same questions. I am not sure of how your organization is structured, but I would highly recommend meeting with the groups that run the major campus computing infrastructure - the folks who do 24/7 support, the ones who run the *big* machine rooms. Talk to them and bring them into the process early. Get their advice. > - unattended/automatic OS installation onto bare metal (autoYast, > kickstart, systemimger etc.) > > - unattended/automatic OS incremental updates to running nodes Absolutely required - here at Duke we use pxe/kickstart/yum to auto install and maintain patch levels on all nodes. Note that the success of your cluster will depend on the local/campus Linux infrastructure - OS repositories, application repositories, local knowledge. If you do not have this readily available, then you will have to build it. Do rely upon the availability of outside Linux resources that you don't have at least some influence with. > - documented plan for handling node hardware failures which includes > specific info on when and how an admin is expected to spend time > diagnosing a problem versus when the admin can just hand the node off > to a vendor or someone else for simple planned replacement or advanced > troubleshooting. For Dell systems you want to have an agreement in > place where your sysadmin can make a judgement call that a node needs > replacement WITHOUT having to first wade through the hell that is > Dell's first tier of customer support. I will second this recommendation. Dell customer support means well, but they are trained to deal more with Mom&Dad's PC rather than a room full of servers. "Reboot the machine with the diagnostics disk" is not an option when it's a fileserver and one disk in a RAID has clearly died. Maintain a full spares kit on-site. The kit should include at lease a hard drive, replacement PS, replacement network switch, and a full set of spares for your myrinet or infiniband network. Don't forget the cables. > If you have the infrastructure in place where your admin(s) can do > everything remotely including OS installs, console access and remote > power control then you may be able to get away with a single admin (as > long as his/her job is tightly scoped to keeping the cluster > functional). If you have not pre-planned your architecture to make > administration as easy and as "hands off" as possible then you are > going to need many hands. Even if you have, I would plan on at least two really *good* sys admins to manage the cluster. I would add a third to manage the storage and backups. > The biggest reason for cluster deployment unhappiness can be traced to > this: > > - management and users expect the cluster operators to also also be > experts with HPC programming, the applications in use, application > integration issues and the cluster scheduler. This almost never works > out well as the skills and background needed to keep a cluster running > are often quite different from the expertise needed to understand the > local research efforts and internal application mix. You cannot underestimate this. You should have at least one full time HPC applications person, probably two. Often this a post-doc level. As you build your cluster, it is vital that you build up your personnel to match. Hope this helps. Good luck! -bill -- bill rankin, ph.d. ........ director, cluster and grid technology group wrankin at ee.duke.edu .......................... center for computational duke university ...................... science engineering and medicine http://www.ee.duke.edu/~wrankin .............. http://www.csem.duke.edu
- Previous message: [Beowulf] number of admins
- Next message: [Beowulf] number of admins
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
