How do you keep clusters running....
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Jeff Layton laytonjb at bellsouth.netThu Apr 11 13:33:58 PDT 2002
- Previous message: How do you keep clusters running....
- Next message: Will the dual Tyan board boot without a graphics card installed?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
lightdee at netscape.net wrote: > Doug J Nordwall wrote: > > >On Wed, 2002-04-03 at 13:04, Cris Rhea wrote: > > > > What are folks doing about keeping hardware running on large clusters? > > > > Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 >nodes)... > > > > Sure seems like every week or two, I notice dead fans (each RS-1200 > > has 6 case fans in addition to the 2 CPU fans and 2 power supply > fans). > > > > > >You running lm_sensors on your nodes? That's a handy tool for paying > >attention to things like that. We use ours in combination with ganglia > >and pump it to a web page and to big brother to see when a cpu might be > >getting hot, or a fan might be too slow. We actually saved a dozen > >machines that way...we have 32 4 processor racksaver boxes in a rack, > >and they rack was not designed to handle racksaver's fan system. That is > >to say, there was a solid sidewall on the rack, and it kept in heat. I > >set up lm_sensors on all the nodes (homogenous, so configured on one and > >pushed it out to all), then pumped the data into ganglia > >(ganglia.sourceforge.net) and then to a web page. I noticed that the > >temp on a dozen of the machines was extremely high. So, I took off the > >side panel of the rack. The temp dropped by 15 C on all the nodes, and > >everything was within normal parameters again. > > > > > > My last fan failure was a CPU fan that toasted the CPU and motherboard. > > > > > >Ya, we would have seen this on ours earlier...excellent tool > > [snip] > > We use Clusterworx, which isn't open source (from Linux Networx), but it goes a step further than Ganglia. It uses lm_sensors and a power control > box (again from linux networx) to actually shutdown a node if it is getting > too hot, and the event parameters are all tweakable. It's always a good > idea to have some kind of cluster monitoring software installed, but it's > nice to be able to setup event triggers in your software in case something goes wrong and you're not around. You can set a shutdown temperature via the BIOS on most decent motherboards. You can also easily script this up if you have some power control unit connected to a node that you can talk to (e.g. APC's stuff). All of the stuff you need it available as Opensource. You can hook all of this together with Ganglia if you want. In fact, Matt has announced (or hinted) at the next version of Ganglia that will start to have a number of new features built in (but not nodal shutdown if I remember correctly). Jeff Layton > > > ---- > David Henry > Synergy Software, Inc. > lightdee at netscape.net > > __________________________________________________________________ > Your favorite stores, helpful shopping tools and great gift ideas. Experience the convenience of buying online with Shop at Netscape! http://shopnow.netscape.com/ > > Get your own FREE, personal Netscape Mail account today at http://webmail.netscape.com/ > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
- Previous message: How do you keep clusters running....
- Next message: Will the dual Tyan board boot without a graphics card installed?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
