How do you keep clusters running....

Doug J Nordwall nordwall at pnl.gov
Wed Apr 3 14:46:31 PST 2002


On Wed, 2002-04-03 at 13:04, Cris Rhea wrote:

    What are folks doing about keeping hardware running on large clusters?
    
    Right now, I'm running 10 Racksaver RS-1200's (for a total of 20 nodes)...
    
    Sure seems like every week or two, I notice dead fans (each RS-1200
    has 6 case fans in addition to the 2 CPU fans and 2 power supply fans).
    

You running lm_sensors on your nodes? That's a handy tool for paying
attention to things like that. We use ours in combination with ganglia
and pump it to a web page and to big brother to see when a cpu might be
getting hot, or a fan might be too slow. We actually saved a dozen
machines that way...we have 32 4 processor racksaver boxes in a rack,
and they rack was not designed to handle racksaver's fan system. That is
to say, there was a solid sidewall on the rack, and it kept in heat. I
set up lm_sensors on all the nodes (homogenous, so configured on one and
pushed it out to all), then pumped the data into ganglia
(ganglia.sourceforge.net) and then to a web page. I noticed that the
temp on a dozen of the machines was extremely high. So, I took off the
side panel of the rack. The temp dropped by 15 C on all the nodes, and
everything was within normal parameters again.


    My last fan failure was a CPU fan that toasted the CPU and motherboard.


Ya, we would have seen this on ours earlier...excellent tool

    
    How are folks with significantly more nodes than mine dealing with constant
    maintenance on their nodes?  Do you have whole spare nodes sitting around-
    ready to be installed if something fails, or do you have a pile of
    spare parts?


No, we don't actually, but we've talked about it


      Did you get the vendor (if you purchased prebuilt systems)
    to supply a stockpile of warranty parts?


we use racksaver as well, so our experience is similar. Probably should
talk to our people about getting some spare nodes

    
    One of the problems I'm facing is that every time something croaks, 
    Racksaver is very good about replacing it under warranty, but getting
    the new parts delivered usually takes several days.
    

Ya...this is another area where just monitoring the data can be
helpful...if a fan is failing, you can see it coming (temperature slowly
rises) and you can order it before hand and schedule downtime.


    
    ----
      Cristopher J. Rhea                      Mayo Foundation
      Research Computing Facility              Pavilion 2-25
      crhea at Mayo.EDU                        Rochester, MN 55905
      Fax: (507) 266-4486                     (507) 284-0587
    _______________________________________________
    Beowulf mailing list, Beowulf at beowulf.org
    To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf

-- 
Douglas J Nordwall	http://rex.nmhu.edu/~musashi	
System Administrator	Pacific Northwest National Labs
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20020403/ad990599/attachment.html>


More information about the Beowulf mailing list