[Beowulf] followup on 1000-node Caltech cluster
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Kewley kewley at gps.caltech.eduSat Jun 18 16:04:56 PDT 2005
- Previous message: [Beowulf] MPI community
- Next message: [Beowulf] followup on 1000-node Caltech cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi all, I wrote to this list on 6/6 about the large cluster that we expect to install at Caltech. I got a bunch of great replies (on- and off-list), and wrote a brief followup on 6/8. Here's another, more definite & detailed followup. It is confirmed that we will definitely be receiving this cluster around 7/20. The shipment will be 26 fully-loaded racks plus various other stuff. I'm working out the receiving details; I believe at this point that with eight workers we can get all 30 tons or so of pallets off the truck(s) and into the subbasement room inside one workday. I have been working closely with the Caltech Transportation department to plan this, and I will work with Dell & the shipper as much as required until we have a very good plan. Our goal is to have the cluster running well enough to show off our near-realtime earthquake application in late August or September. There is of course an enormous amount of work that needs to happen both before and after that point. I was informed yesterday that there will be only one technical staff member supporting the cluster. We will work very closely with our vendors to get things working right, and we will hire additional help in the first couple of months if needed. But after the initial period, I'll be the only support person. I'm excited & honored to work with this cluster, but I'm not at all certain that my best efforts are capable of giving the results they want. We'll see; I've made my opinions known quite clearly. I've also advised them to consider what happens if I get hit by a bus or otherwise become unavailable. Many of your replies focused on the need for as much automation as possible. We will in fact have a lot of automation capability available; the challenge will be integrating it all. The room has a 500kVA/400kW Liebert UPS for the computing equipment, a 130kVA UPS for the HVAC, and six ?-ton Liebert chilled-water HVAC units. The cooling is adequate; the numbers suggest that we'll be close to maxing out the 400kW computer UPS. All of this equipment will be networked, and our automation systems (to be built) will have access to lots of data and control parameters on this equipment. Our compute node racks will dissipate up to ~14kW. The power strips are three 5.7kW 208V 3-phase APC networked/switched "rack PDUs". The intakes for the HVAC units are ~3 feet from the backs of the racks. We expect to have almost laminar air flow, but with good local & roomwide mixing due to an up&down airflow pattern from the supply ducts. We also have a BTU meter hooked up to the chilled water supply/return lines. Combined with the UPS data, we will thus be able to have our automated systems continuously calculate the energy balance -- is the cooling system taking out the energy going in? The sign of the instantaneous energy balance is a very good predictor of the temperature trend in the room. Thus we will be able to initiate alarms and/or shutdown procedures long before the temperature rise is noticeable. This is all to be automated *before* the systems are permanently powered up -- we will not have 24/7 human coverage, so the automated response systems are critical. It was calculated that if the air cooling suddenly stopped while the power into the room was 400kW, but the fans kept the air circulating, the average rate of temperature increase would be 4 degrees F per second. I remain a bit skeptical of that number personally -- for example, there was no allowance for the heat capacity of the now-stationary chilled water in the heat-transfer coils, nor of the walls, floors, ceiling, air ducts, and other surfaces in the room. But the Planetary Science folks here know their atmostpheric modeling (it's what they do for a living), so I'm pretty confident that their results are correct given their simple assumptions. We plan to configure our automated systems to initiate a compute-node shutdown immediately upon loss of power or loss of chilled water flow, to be completed in ~30seconds, with enforcment at the 30-second mark via our separate controls on the individual power lines. Shutdown of the more critical systems may be delayed, but will also occur automatically in a continuing loss situation. We'll also deal with events from the smoke alarm / precharge fire suppression system and the subfloor water sensors in automated, appropriate ways. We'll have a programmatic trigger for the Emergency Power Off circuit for the room, as well as a number of shielded-pushbutton and emergency-break-glass buttons in the room itself. I'm hoping to recruit five volunteers to stand at the six AC units with me and test what happens if all the fans stop at once with the computers running full-tilt. Hopefully we won't cook too quickly to turn the units back on. The person who did the calculation refuses to help with this test, and thinks we shouldn't do it at all. :) We've already had a flood in the subbasement that left at least 2" of water on the sub-raised-floor slab (cause of the flood: plumber errors in a different area of the subbasement). The machines stayed up until I shut them down manually (this happened on a Friday night; I got a 7:13 AM Saturday phone call), but the power outlets were on the slab sitting in (clean) water for a few hours, so the electrical contacts were completely corroded. The outlets are now raised 13", and the conduits to the outlets are waterproof. We also have calcium deposits on the unsealed cement slab, which sits on soil. The cause is water impregation of the slab (from the flood and from failure of the under-slab sealing layer) bringing lime to the surface where the water evaporates, leaving the lime. I will be investigating how best to prevent this in the future, and will be arranging for cleaning. The sub-raised-floor area is a high-speed air conduit for the facility, and we've already had a couple of lime snowstorms when previously-idle HVAC units were turned on. Regarding automation capabilities of the computing equipment itself: We will have console access via the Dell IPMI BMC (baseboard management controller), over the ethernet network. I have a hint at least that BIOS management and updates can be automated to some degree on these machines, but I haven't verified the details. The BMC will give us power-on/off/reset control, plus we have networked switched power strips as another mechanism to control power to individual nodes. Rocks is installed on our existing 160-node cluster, but I am far from sufficiently familiar with it. I will be learning a lot about it in the near future, and I've gotten expressions of interest & support from Rocks authors & users. One thing Rocks will do (combined with the DELL PXE support) is automatic re-imaging of the compute nodes. So that's taken care of. We have NBD onsite service on the compute nodes and 4hr onsite service on the critical equipment. I fully intend and expect to get Dell to help me tap this support efficiently. I am confident that will work OK -- if it doesn't, they'll hear about it. We do have a few spare compute nodes; I believe we'll also have a spare 48-port GigE Nortel switch (these stackable switches form our GigE network). The node hard disks are 10k & 15kRPM SCSI. Out of the 161 machines I have now, a handful of disks have gone south (in ~6 months), and a few more nodes have failed for other reasons. We will have support for LSF and for Ibrix; I expect good support whenever I need it. Of course, I'll have to become intimately familiar with these technologies myself. I have Myrinet running fine on our 160-node cluster. I had no significant hardware issues, and the only software issues were getting the required versions of the software for our hardware (not included with Rocks 3.2), and doing several days of work to convert e.g. gm into a proper rpm. GM's build tools don't conform well to rpm's design assumptions, but I got a very good rpm in the end. Upgrades will now be easy. I will get very good support from Myricom -- they have been very helpful already, and are very interested in seeing this cluster work well. I have a good deal of Linux experience, and I have a good number of immediate colleagues whose experience I tap regularly (and they mine). We have good Linux resources on campus. Even so, I will be treading on territory that no one on campus has seen before. This machine room will be rivaled at Caltech only by the CACR facility, which is a collection of a large number of systems in one large room. Our data storage will be a top-notch DataDirect SAN with 30-40 TB total available after redundancy, built on FibreChannel disks. I expect to have a single filesystem served by sixteen Ibrix segment nodes with high-availability failover of the servers (and RAID and other redundancy on the SAN side). I am unsure whether I'll send all the storage data over the Nortel-based GigE network (our initial design), or whether I'll kick some compute nodes off the Myrinet and put the Ibrix segment servers on that. It will depend on the performance and other issues we see, and I'll have some experts to draw on when deciding these issues. We will not support user storage of non-scratch data on the node-local hard drives. A select subset of the multi-TB main data store will be backed up using a 4.8TB-native LTO2 tape library. The user pool is already fairly experienced on a couple of mid-sized clusters, and I expect we'll keep that experience pool growing as students etc. come and go. They will be expected to help each other with the majority of helpdesk and howto issues. Regarding mission creep, the problem is that I have a science background and enjoy programming, so I'd *like* to work with users on their code issues. :) I've been told, though, that application support is not my responsibility. I will have to exert a lot of self-discipline not to get scattered. There ya go, a more complete description of our situation. I'll be contacting our vendors (including those on this list) for help in planning and execution in the next month and beyond. Beyond that, I may not have sufficient time to respond to each person who writes to me. All the same, I very much appreciate any feedback you send my way. Thanks! David
- Previous message: [Beowulf] MPI community
- Next message: [Beowulf] followup on 1000-node Caltech cluster
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
