[Beowulf] Setting up a new Beowulf cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.Robert G. Brown rgb at phy.duke.edu
Wed Feb 13 07:45:59 PST 2008
- Previous message: [Beowulf] Setting up a new Beowulf cluster
- Next message: [Beowulf] Re: PVM on wireless...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
On Fri, 8 Feb 2008, Berkley Starks wrote: > Thank you all so much for the advice so far. This has helped me see a few > more of the things that I did not realize at first. > > For a little info on the project, I developed this project as a tool to work > on my Senior Thesis in a year or so. Doing computational nuclear physics > requires such resources. It will also be used heavily for Monte Carlo > Simulations and just about any other form of computational physics. The two > named are definite projects that are already on the line up for when I do > get the cluster up and functional. (Sorry about the delay, I'm busy busy busy:-) OK, this (and the stuff below) makes your job relatively easy. I'm going to guess that your application mix will almost certainly be "embarrassingly parallel" at least at first -- lots of compute nodes running MC simulations in nuclear physics (a situation we also have here at Duke) plus people running random applications of one sort or another in a sort of "compute farm" way. After you've head it for a few years, you'll probably start to develop at least a few "real parallel" applications, so we'll use a design that can segue into that, but to do that "right" you'll have to deliberately engineer the cluster to fit the task and will need an actual budget. You'll need an actual budget to get started here, too, especially if you want to build a cluster that is actually "useful". Here's the math. According to Moore's Law (a scaling law for computing performance at constant cost that has functioned at least approximately well for close to maybe 45 years) compute power at constant cost has doubled roughly every 18 months. That means that four year old machines, by the time you get them, will be 2^3 = 8 times slower than a brand new machine that costs just as much as they cost when they were new. Since machines -- amazingly powerful machines, like dual processor dual core 64 bit CPU machines -- can be purchased for (say) $2000 give or take a bit depending on what precisely you get on them and might be MORE than 8x faster than an old 32-bit P6 machine, you're going to have the paradox that some faculty desktops will be faster than your entire cluster. To put it another way, while using old machines is fine for making a learning cluster, it's going to suck in production, with a lot of work and investment required to get to where you could go far more easily by buying a single new desktop at modest cost. The design I'm going to suggest for you (Geeze, I feel like Clinton on What Not to Wear) is a tasteful cluster, one that is intially surprisingly affordable as it gives you the opportunity to learn about clustering and provide your nuclear group with "a" place to run jobs, yet it can grow and change as your needs (and budget!) grow and change. Let's budget it out. Your cluster will need a home, and there are good homes and not so good homes depending on its scale. Close to networking is good. In a rack is great, although you can certainly get started on heavy duty steel shelving. On a floor that is rated to support the weight of your growing stack of hardware is key -- a fully loaded rack can be quite heavy, and nothing ruins your day like having a rackful of expensive hardware crash through the floor to land on the head of somebody one floor down (or worse, break all the way through in a cascade effect down to the basement). Hard on head, and likely to break all that expensive equipment. Oh, and the building. Did I mention the lawsuits? The three critical components required by your cluster in its physical home are power and cooling and a network or network access. A "box" -- a cluster node containing one or more processors -- typically draws between 100W minimum to around 250W, depending on how many processors it has, how much memory, whether or not it has disk(s) or other peripherals. This is rule of thumb, YMMV. At one point I would have estimated 100W per CPU, but nowadays I think it is probably down to more like 50-60W per CPU core (anybody have current numbers on actual hardware to contribute)? If we assume that you'll get started with a humble 8 contributed ancient P4's at 125W each, that's a kilowatt right there. Add networking, add disks, add monitor(s), add a separate server, and you're right up at the limit of a standard 20 ampere circuit. This means that you will need AT LEAST one dedicated 20 Amp 120VAC circuit to run your cluster, and will need ADDITIONAL circuits as your cluster grows. They don't all have to be handy when you get started, but if you try to put the cluster onto an existing, already half-loaded circuit it's going to trip breakers when you first power it on and that's embarrassing so think ahead. As a physics student, you will recall thermodynamics. All the power consumed by cluster nodes appears shortly thereafter in their immediate environment as heat. If you remove the heat as fast as it is generated, the environment (and nodes) remain at a constant temperature. If not, it gets hotter until thermal diffusion through e.g. the walls of the space balance it out. Computers HATE to be hot. They express their irritation by breaking, burning out early, actually malfunctioning and throwing bit errors that ruin a computation. We want our cluster to be cool and happy and last a long time and run reliably, so we want our cluster space to be anywhere from cool to COLD. The rule is that computer componets lose a year of expected lifetime for every 10 degrees farenheit above an ambient air temperature of 68F (20C) which is a "cool" temperature for an office. 60F is better still -- most server rooms are maintained with ambient air temperature as cool as 50F (10C), more likely ballpark 60F under load. Air conditioning capacity is measured in "tons", where a ton of AC is a unit capable of removing the heat required to transform a ton-sized block of ice from ice into water at 32F (latent heat of fusion, work it out) which just happens to be ~3500 watts. You want to be able to stay AHEAD of the heat and actually cool heat infiltration from outside, so you'll need more (25-35% more) AC capacity in watts than you have power capacity in watts. You also need to worry about air circulation, especially if you're building the cluster in e.g. a closet (NOT recommended). A big open room gets a bit of convective help and is better than a small closed space. The air should be and remain dry. Then the space needs networking. There are two aspects of this to consider, and they're not separate for the cluster design I'm going to suggest. One is the network required by the actual cluster nodes, which communicate with each other (if needed) and the "master" node and other workstations in the department (certainly) via network interconnects. The other is the network connection to the rest of the department -- how are people going to use the cluster? It is by far the easiest if they can just start jobs up from their desktops, which means everything needs to be on the same network. Minimally, then, the cluster space needs >>a<< network wire running into it from the building networking closet and connected to its presumed switch. Beyond that, there are several ways to proceed, depending on local politics, who provides what, who "owns and runs" what, and practical considerations. For example, one scenario is that you upgrade the existing building networking closet by adding a 48 port professional-grade gigabit ethernet switch that is uplinked into the existing possibly slower department switch. A nice fat bundle of cable is run from the punchblocks in this closet back to your cluster space, and punched into a panel of RJ45 ports in a rack in your cluster space. As you add nodes, you simply cable them into this rack and add cables in the wiring closet from the punch port back to the switch. This has many advantages -- one being that you can hook (selected) faculty or office DESKTOPS into the gigabit switch so they are on the same flat network, effectively INCLUDING THEM IN THE CLUSTER. Since some of your faculty -- the ones doing the MC computations, for example -- will have power desktops that might equal or exceed the power of your initial cluster, this gives EVERYBODY potential access to all of that power if you establish a resource-sharing policy and can make your initial cluster 3-4x as powerful as it might otherwise be quite easily, especially if you have spare cycles you can salvage on e.g. student clusters that are idle all night. Another scenario is that you get a single smaller gigabit switch for your cluster, mount it in the rack or on the shelf of that cluster, and have a single gigabit link back to the department network. This gives you a bottleneck between the faculty desktops and the cluster, but for embarrassingly parallel code it won't matter. I'm guessing this is the way you will go initially, and you can always change over later, but SOMETIMES if you dicker things like the former out now, you can get other people to pay for them and end up with something really nice and scalable for the future, or at least grease the way for later when you need to go back and say you've outgrown the first effort and need to reconsider. IF you ever get to where a higher end network is necessary -- a "real" dedicated cluster network -- you'll probably need to use the local switch architecture anyway, although you might well have both that network and whatever switched gigE/TCP/IP network you started with at the same time. Anyway, enough on infrastructure. Let's talk about the cluster and what you'll need to acquire or budget. I truly think that you're going to need a budget of a few thousand dollars even to get started, although if you can't get even that little an amount, well, we'll do what we can. Cheapest Possible 2-8 Box Learning Cluster Ingredients: Heavy duty steel shelving ($50 at home depot). 8-10 port gigabit switch ($50 from numerous makers and vendors). 16 6' to 14' patch cables, cable ties, 2 surge protector power strips, small work table/bench, work chair -- scrounged if possible, $250 would buy it all and a nice little pocket toolkit as well. You will need a monitor and keyboard (and possibly a mouse) on the workbench and connectable to the backs of each node on demand. Scrounging is OK, you can get a nice flat panel that draws less power (and makes less heat) and is a lot easier to move around for around $200-250, a whole KVM setup for easily less than $300. You may want to consider getting a small KVM switch to make it "easy" to switch between consoles on nodes but this is a luxury item and really belongs in the next description instead. For nodes you take what you can scrounge and augment them by buying what you can afford. You should be prepared to repair nodes, buy nodes gigabit ethernet cards, and add memory or a disk to nodes, at cost or from a "boneyard" of scavenged parts from systems that are DOA but have usable memory chips or CPUs or power supplies that still work. Still, I'm guessing you'll need a few hundred dollars absolute minimum in a budget to get started. Your "free" nodes will only rarely turn out to really be free; more often you'll have to drop maybe $50 into them to add memory and networking (again, this cost and the differential cost of power alone favors BUYING brand new nodes over fixing up old nodes -- THERE IS NO PRICE-PERFORMANCE WIN in going cheap, for all that it is very informative and a great learning experience). One node you will almost certainly want to buy, or build out of the best of what you can scrounge. This is your cluster's "head", or "server" node. I'm going to suggest a flat cluster design, so the latter is a more reasonable description. This is a machine you fix up or purchase with: * lots of memory, 1-2 GB if possible. * multiple CPUs or CPU cores. 2-4 if possible. * a "good" e.g. Intel gigabit ethernet interface, or even two. * 3-4 largish disks, configured in an md raid level 5. * a "good" graphics adapter -- one capable of running a graphical display efficiently and at a decent resolution (which should of course match up decently with the capability of your monitor, which I suggest be capable of at least 1280x1024 and at least 17" diagonal). This machine is the one that you set up with a full linux desktop and an NFS exportable filesystem for /home and/or workspace on all of the nodes. It MAY end up being a DHCP/PXE server (which may require that it be on a private network in order not to fight with departmental servers which in turn may require that it have that second ethernet interface), a web server (to facilitate HTTP-driven PXE installs), a diskless node server (if you go with a diskless node design to save money and power at the expense of a somewhat steeper initial learning curve). In master-slave computations it will likely be the master. In computations run in "batches" it will be the place those jobs are submitted, and the place users will visit to retrieve results. It will be the node you "name" for the cluster (usually) where the nodes will usually have abbreviated hostnames like b01, b02, b03... I would budget a MINIMUM of $1500 for this node, purchased new, $2000 would be better. If you rebuild out of parts, you'll need to scrounge an old system with a big enough tower to be able to hold 3-4 disks (usually a mid-size tower will be a bit tight) with as fast a CPU as you can manage and as much memory as you can afford to add and with 1-2 gigE interfaces. I am not including backup devices in this cluster design -- too expensive. This gives you (tallying things up) the need for at absolute minimum a budget of $1000-$1500, which presumes that you scrounge nearly everything but still need to buy disks, memory, spare parts, network switch, with a bit leftover to handle server crashes and make life comfortable. You'd do far better with a budget of $3500, buying yourself a nice server/head node, setting up a nice working environment and a much larger network switch from the beginning and still having $1000 or so to fix up scrounged nodes. NOTE WELL! As noted above, if your cluster is "flat" with the department (linux) network, you can easily enough make your scheduler distribute jobs out to individual (linux) desktops and include them "in" your cluster using e.g. Condor as a resource manager. In fact, you can make a "cluster" out of your existing linux LAN at no investment but time and software configuration IF your department policy and so on permit it. It often depends on who "owns" those desktops and what they get out of it -- linux is perfectly capable of running a desktop interactive session with somebody AND a background numerical task with essentially no impact of the latter on the former -- desktop computing rarely uses as much as 1% of a system's total compute capacity. On to what I think of as a "better idea" Inexpensive Starter Cluster with a Future The good thing about the cluster above is that it is cheap. Oh, you can go even cheaper. Take two systems, slap linux on them, pop them on any old network and it is "a cluster" in that you can run computations on both at the same time and add more nodes when you find them. Or just look at your department linux lan, enable logins for all users on all desktops, establish a policy for use or install a policy tool like condor and go "poof! you're a cluster!". That's a description of my own home cluster -- a flat switched network with lots of linux boxes that are "a cluster" when I want them to be and desktops the rest of the time, where I don't even bother with Condor (ownership being clear, policy unneeded). However, the BAD thing about it is that cheap as it is, anything built with 4 year old hardware is a loser right out of the box. Seriously. The differential cost of POWER ALONE over a single year will generally buy you a single modern system that is as fast or faster than the entire cluster. That's the bitchin' thing about Moore's Law -- there is no sane afterlife for systems because it gets to where the cost of operation alone exceeds the cost of replacement, and then we can do all sorts of TCO computations and assessments of the cost of maintenace and conclude that it is really really dumb to do this UNLESS other people will pay for power no matter how much you use but not give you the money to buy nodes. Which happens so often that it isn't funny, but it is stupid nevertheless. Or for student/learning clusters, where you do what you can and have NO budget but what you can raise at a bake sale. I advise 2-3 students a year in that category, so I'm pretty sympathetic to it, but I advise them to come up with a few thou a year budget nevertheless. So here's "better" design. It costs more initially, but it will scale nicely out to racks and racks of systems, and the systems you get will always be boxes that your nuclear faculty will drool over and WANT to run their jobs on -- so much so that initially they'll fight to get time, and be properly motivated to write grants with an equipment budget that contributes a few nodes a year or more to your collection. Start by buying a nice, 43U, four post, open equipment rack. IIRC you can get one for around $400 that will work just fine (don't get $1000+ ones with glass doors and whatever -- you're not made of money after all). Get a nice 48 port professional-grade rackmount gigabit ethernet switch for maybe $800. Get a few packets of ethernet cables, different colors, in lengths from 6' to 14', velcro cable ties, maybe a rackmount power distribution system (not necessarily a "UPS", mind you), cable holders -- enough stuff to outfit your rack so that it can be kept "pretty" -- and easy to maintain. This might cost another $500. Into this put a nice rackmount raid system "head node" with maybe a TB of storage capacity and a BACKUP system -- a tape library. Initially you can "get by" with what amounts to an enhanced node with four disks and no backup, but you'll have to warn your users that there is no backup and that they are responsible for securing, copying, mirroring their own valuable data elsewhere. Backup is expensive (which sucks) but for a professional operation it is obviously essential. I'd budget a MINIMUM of $2500 for the disk server alone, $4000-5000 for disk server plus backup. These numbers are starting to get really soggy -- you'd best get real quotes for exactly what you want to START with, then go find the money and not the other way around lest you end up short! Nodes are then added to the limit of your budget, ideally in a standard form. These days I'd recommend dual processor quad core nodes for CPU-bound Monte Carlo computations, dual-duals for codes that do a lot of vector algebra, and possibly plain old dual processor nodes if you have jobs that are REALLY memory bound to where even dual cores start to collide (YMMV very much here, be warned). dual-quads will get you optimum raw compute capacity per dollar, though, I think, and sound ideal for your expected initial task mix. Outfit the nodes with at LEAST 1 GB per core, 2 GB is better. Any nodes you buy in this way will have 2 gigE interfaces integrated on the motherboard, which is fine. Try to get 3 to 4 year onsite service contracts on all "critical" electronic hardware you buy, from the switch on down. As noted above 3 years = "infinity", at the end of this 3 year warranty you'll need to be looking for replacement hardware in any event, as the cost of powering any 8 nodes for a year will get really close to the cost of BUYING a single node that will do the work of the 8 with the power cost of only one. Node prices, including warranty, will then range from as low as a bit under $2000 to $4000 depending on memory, number of cores and so on. Avoid bleeding edge processor clocks for YOUR starter cluster -- look for the sweet spot in CPU clock (aggregate cycles) per dollar spent, usually the second or third cheapest available CPU in any given configuration (bearing in mind the TOTAL SYSTEM price, not just raw CPU price in your cost-benefit estimates). Going this route, $1000 for rack plus accessories, $2500 for a head node, $2000 for a single worker node, $500 for error in my seat of the pants estimates and miscellaneous stuff -- you can "get started" with at least 4, maybe 8 >>modern<< (64 bit, uberfast) CPU cores for around $5000, get started with backup for around $7000, get started nicely with as many as 24 CPU cores for maybe $12,000. Which is still, believe it or not, chickenfeed in the research business. This design scales beautifully. Go to your nuclear groups, pass the hat. Offer them free room in the rack, access to server and switch and backup (all paid for by the department, the university, a startup grant, whatever) if they pony up $2000-4000 for N-core nodes that are selected from the following list, with mandatory onsite service contracts. They'll jump at the chance -- they'd have to spend twice as much to get the same capacity as THEY'D have to provide access to a server, AC, power, infrastructure, management. Point out that with lots of participants, they can share resources -- everybody individually will have down time when they're writing papers, are out of town, on vacation -- and they can trade access to their nodes when they're not using them to others in return for the same favor the other day. So if they buy a node with 8 cores in the rack and so do three other groups, there might come a day when they can use all 32 processors in a pinch to finish off a paper before a deadline. It is also easy to write proposals for. Any of your groups can write or add to a propoposal a budget for N nodes that fit in the existing rack. University cost-sharing is manifest, resources are well-leveraged, funding is likely. With a full-height rack, you can add as many as 40 1U nodes to 3-5U of switches and servers, on a floor that can hold 1 ton per square meter, in a space that can provide 8-10 KW and 4 tons of AC (per filled rack). THIS sort of design can scale right on out of your department. Chemistry may want to play. So might engineering. Even economics does large scale computations nowadays. You might find yourself setting up and filling a cluster room with multiple racks, wall-sized Liebert ACs, and so on. Or anyway, you can at least dream...;-) Obviously I favor this approach if you can finagle the minimum $5K buy-in, STRONGLY favor it if you can scare up $7K or more. I also tend to recommend that you look at e.g. www.penguincomputing.com for possible nodes, because they are linux-passionate and their AMD opteron nodes are excellent performers and simply (to my own experience) do not break. They'll likely cut you a break of a few percent on a collective "getting-started" price as well. When trying to "sell" this approach, point out to the powers that $5K, $10K, $15K is not the real cost of the cluster. The $1 per watt per year for power and AC (estimated) is not the real cost either. The real cost is the human time required to design it, set it up, and manage it. That cost is $50K and up per year! If you're doing this as a project "for free", they are already getting tens of thousands of dollars of free resource, which should certainly factor into the leverage required to pry the money loose to take proper advantage of you! rgb > I want to be able to make the cluster easily expandable, in that I will be > starting with only a few machines (about 2-8), but will be acquiring more as > time goes on. The university that I am attending surpluses out "old" > machines every 4 years, and we have set up a program where we can get a > percentage of the surplus machines for out cluster. > > So, as for size. Initially it will be a smaller cluster, but will grow as > time goes on. > > Being new to the Beowulf world, I am just mainly looking for some advice as > to what distro to use (I would never dream of setting up a cluster on > windows) and if there were any little tricks that weren't mentioned in the > setup how to guides. > > Oh, and I would also like to know if there was a way to set up a task > priority where if I had only only application running it would use all the > processors on the cluster, but if I had two tasks sent to the cluster then > it would split the load between them and run both simultaneously, but still > using a maximum for the needed processors. > > Thanks again so much, > > Berkley > > On Feb 8, 2008 9:11 AM, Robert G. Brown <rgb at phy.duke.edu> wrote: > >> On Thu, 7 Feb 2008, Berkley Starks wrote: >> >>> Hello all, >>> >>> I've been a computer user for the past several years working in >> different >>> areas of the IT world. I've recently been commissioned by my university >> to >>> set up the first operating Beowulf Cluster. >>> >>> I'm am moderately familiar with the Linux OS, having ran it for the past >>> several years using the distro's of Debian, Ubuntu, Fedora Core, and >>> Mandriva. >>> >>> With setting up this new cluster I would like any advice possible on >> what OS >>> to use, how to set it up, and any other pertinent information that I >> might >>> need. >> >> This question has been answered on-list in detail a few zillion times. >> I'd suggest consulting (in rough order): >> >> a) The list archives (now that you're a member you can get to them, >> although they are digested and googleable for the most part anyway). >> >> b) Google. For example, there is a lovely howto here: >> >> http://www.linux.org/docs/ldp/howto/Parallel-Processing-HOWTO.html >> >> that is remarkably current and a good quick place to start. >> >> c) Feel free to browse my free online book here: >> >> http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php<http://www.phy.duke.edu/%7Ergb/Beowulf/beowulf_book.php> >> >> I'm working on making it paper-printable via lulu, but I need time I >> don't have and so that project languishes a bit. You "can" get a paper >> copy there if you want, but it is pretty much what is on the free >> website including the holes. >> >>> Oh, and the cluster will be used for computational physics. I am a >> physics >>> major making it for the physics department here. It will need to be >> able to >>> use C++ and Fortran at a bare minimum. >> >> C, C++ and Fortran are all no problem. The more important questions >> are: >> >> a) How coupled are the parallel tasks? That is, do you want a cluster >> that can run N independent jobs on N independent nodes (where the jobs >> don't communicate with each other at all), or do you want a cluster >> where the N nodes all do work on a common task as part of one massive >> parallel program? If the former, you're in luck and cluster design is >> easy and the cluster purchase will be cheap. >> >> b) If they are coupled, are the tasks "tightly coupled" so each >> subtask can only advance a little bit before communications are required >> in order to take the next step? "Synchronous" so all steps have to be >> completed on all nodes before any can advance? Are the messages really >> big (bandwidth limited) or tiny and frequent (latency limited)? >> >> If any of these latter answers are "yes", post a detailed description of >> the tasks (as best you can) to get some advice on choosing a network, as >> that's the design parameter that is largely controlled by the answers. >> >> rgb >> >>> >>> Thanks again >>> >> >> -- >> Robert G. Brown Phone(cell): 1-919-280-8443 >> Duke University Physics Dept, Box 90305 >> Durham, N.C. 27708-0305 >> Web: http://www.phy.duke.edu/~rgb <http://www.phy.duke.edu/%7Ergb> >> Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php<http://www.phy.duke.edu/%7Ergb/Lilith/Lilith.php> >> Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977 >> > -- Robert G. Brown Phone(cell): 1-919-280-8443 Duke University Physics Dept, Box 90305 Durham, N.C. 27708-0305 Web: http://www.phy.duke.edu/~rgb Book of Lilith Website: http://www.phy.duke.edu/~rgb/Lilith/Lilith.php Lulu Bookstore: http://stores.lulu.com/store.php?fAcctID=877977
- Previous message: [Beowulf] Setting up a new Beowulf cluster
- Next message: [Beowulf] Re: PVM on wireless...
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list