[Beowulf] Remote console management
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Douglas Eadline deadline at clustermonkey.netSat Sep 24 10:21:29 PDT 2005
- Previous message: [Beowulf] Remote console management
- Next message: [Beowulf] Remote console management
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> We're getting ready to put together our next large Linux compute cluster. > This time around, we'd like to be able to interact with the machines > remotely. By this I mean that if a machine is locked up, we'd like to be > able to see what's on the console, power cycle it, mess with BIOS > settings, and so on, WITHOUT having to drive to work, go into the cluster > room, etc. > This brings up an interesting point and I realize this does come down to a design philosophy, but cluster economics sometimes create non standard solutions. So here is another way to look at "out of band monitoring". Instead of adding layers of monitoring and control, why not take that cost and buy extra nodes. (but make sure you have a remote hard power cycle capability). If a node dies and cannot be rebooted, turn it off, and fix it later. Of course monitoring fans and temperatures is a good thing (tm), but if node will not boot, and you have to play with the BIOS, then I would consider it broken. Because you have "over capacity" in your cluster (you bought extra nodes) this does not impact the amount work that needs to get done. Indeed, prior to the failure you can have the extra nodes working for you. You fully understand that at various time one or two nodes will be off line. They are taken out of the scheduler and there is no need to fix them right away. This approach also depends on what you are doing with your cluster and the cost of nodes etc. In some cases out-of-band access is a good thing. In other cases, the "STONIH-AFIT" (shoot the other node in the head and fix it tomorrow" approach is also reasonable. -- Doug check out http://www.clustermonkey.net
- Previous message: [Beowulf] Remote console management
- Next message: [Beowulf] Remote console management
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
