[Beowulf] Remote console management

Fri Sep 23 04:33:22 PDT 2005

Bruce Allen wrote:

> We're getting ready to put together our next large Linux compute 
> cluster. This time around, we'd like to be able to interact with the 
> machines remotely.  By this I mean that if a machine is locked up, 
> we'd like to be able to see what's on the console, power cycle it, 
> mess with BIOS settings, and so on, WITHOUT having to drive to work, 
> go into the cluster room, etc.

This is the goal, but all the solutions I have ever tried implied a 
monthly journey to the cluster room to manually reboot the problematic 
nodes.

> One possible solution is to buy nodes that have IPMI cards.  These 
> piggyback on the ethernet LAN and let you interact with the machine 
> even in the absence of an OS.  With the appropriate tools running on a 
> remote machine, you can interact with the nodes even if they have no 
> OS on them or are hung.

I would say that it depends of the problem hunging the machine... for 
example there are well known problems with IPMI cards that you cannot 
contact anymore when installed on a freeBSD system.
Moreover, before buying some IPMI cards, you should be aware that there 
are diffenrent hardware implementation of IPMI cards (have a look at 
Intel's website they have some slides explaining the difference between 
cheap IPMI and complete implementation).

> Another solution is to use the DB9 serial ports of the nodes.  You 
> have an 'administrative' box containing lots of high-port-count serial 
> cards (eg, Cyclades 32 or 64 port cards) and then run a serial cable 
> from each node to this box.  By remotely logging into this admin box 
> you can access the serial ports of the machines, and if the BIOS has 
> the right settings/support, this lets you have keyboard/console access.
>
> Or one can do both IPMI + remote serial port access.

remote serial port access should be done outside IPMI, but still I would 
say that it depend of the IPMI board you are installing.

I even think that if you want to cut the costs, you can avoid IPMI and 
rely on ssh, then remote serial port login and then controlled power 
plugs to reboot the nodes if any of the previous solution does not work.

>
> Could people on this list please report their experiences with these 
> or other

> approaches?  In particular, does someone have a simple and inexpensive 
> solution (say < $100/node) which lets them remotely:
>  - power cycle a machine
>  - examine/set BIOS values
>  - look at console output even for a dead/locked/unresponsive box
>  - ???
>
A cheap solution I used previous year was to use USB->8 x db9 with 
nullmodem cables, along with kermit, you can get a cheap terminal server 
(extended with the right number of USB hubs).

Something interesting we used (and are still using without any problem 
since installation), is a homemade reboot solution, replacing the 
frontpanel with a controled switch (in the final hardware design we 
found some industrial grade controlled transistor) every boxe allows to 
control 16 nodes and you can chain 256 of them, which is ok for big 
clusters, the only problem, is that as a homemade solution, you have to 
solder everything (replacing frontpanels is not a big deal, because, it 
just means replacing the original pins with the one of your solution, no 
soldering should be required on the nodes).
The cost is about 100$ for 16 nodes if I remember everything. Aftre that 
you can control those transistors to reboot / halt / start every node 
from a single rs232 port.
I can send you more details if you are interested.

Regards,

Julien Leduc