[Beowulf] Remote console management

David Kewley kewley at gps.caltech.edu
Sun Sep 25 15:33:26 PDT 2005


On Friday 23 September 2005 09:05, Jerker Nyberg wrote:
> I am currently installing some Dell 1850 (with remote access cards) and
> HP DL140/DL380 and it would be great with some input from someone on the
> integrated remote access in a Linux environment. Remote console and
> reset/on/off is good enough for me.

I am bringing up a large cluster of PE 1850s right now.  Dell offers Linux
command line tools to change most BIOS & BMC settings from within the host
OS.

The command-line non-ipmi tools are part of Dell's OpenManage free product.
This has been excellent for tracking down e.g. memory errors during the
cluster burn-in period.  From the master node I simply do:

  shmux -m -c "omreport system esmlog" - < /ml/all-1024 > junk

  grep Descr junk | egrep -v "(Ambient Temp|log cleared|Intrusion)" \
    sort | uniq -c

This give me an output like this:

      1 compute-11-38.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
      1 compute-12-7.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
      1 compute-15-24.local: Description   : correctable memory error logging disabled
      6 compute-15-24.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      2 compute-17-37.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
      2 compute-22-26.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
    375 compute-22-33.local: Description   : ECC Error Correction detected on  Bank 2 DIMM A
    333 compute-22-34.local: Description   : ECC Error Correction detected on  Bank 3 DIMM A
      4 compute-23-16.local: Description   : ECC Error Correction detected on  Bank 2 DIMM B
      3 compute-23-22.local: Description   : ECC Error Correction detected on  Bank 3 DIMM A
     20 compute-24-1.local: Description   : ECC Error Correction detected on  Bank 1 DIMM A
      1 compute-25-26.local: Description   : ECC Error Correction detected on  Bank 2 DIMM B
    103 compute-25-29.local: Description   : ECC Error Correction detected on  Bank 2 DIMM B
      1 compute-26-1.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
     18 compute-31-26.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      2 compute-32-10.local: Description   : correctable memory error logging disabled
     12 compute-32-10.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
      1 compute-32-19.local: Description   : BMC Riser PG voltage sensor state asserted
      1 compute-32-19.local: Description   : BMC Riser PG voltage sensor state deasserted
      3 compute-32-22.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      1 compute-35-18.local: Description   : correctable memory error logging disabled
     13 compute-35-18.local: Description   : ECC Error Correction detected on  Bank 3 DIMM B
      2 compute-37-15.local: Description   : correctable memory error logging disabled
     12 compute-37-15.local: Description   : ECC Error Correction detected on  Bank 1 DIMM A
     10 compute-42-30.local: Description   : ECC Error Correction detected on  Bank 2 DIMM A
      2 compute-42-33.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      1 compute-43-19.local: Description   : correctable memory error logging disabled
     11 compute-43-19.local: Description   : ECC Error Correction detected on  Bank 2 DIMM A
      1 compute-43-5.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      1 compute-46-31.local: Description   : ECC Error Correction detected on  Bank 2 DIMM B
    279 compute-47-40.local: Description   : ECC Error Correction detected on  Bank 1 DIMM B
      1 compute-47-9.local: Description   : ECC Error Correction detected on  Bank 2 DIMM A

Now I know I need to replace at least 4 specific sticks of RAM.  (This
doesn't mean "Dell RAM" is bad -- we have 6144 sticks in our compute
nodes, and I believe we're getting around 1-2% initial failure.)

You can report many things with omreport, configure things with omconfig,
and run many diagnostics with omdiag.  All these tools are launched from
within the target's host Linux OS.

Note: I grep out "Ambient Temp" because our room has a tendency to be colder
than Dell's default warning threshold. :)  I'll be changing that threshold
using omconfig very soon.

As far as IPMI, Dell offers ipmish, with which you can do e.g a forced
power-off on a machine remotely (and outside the machine's OS) with e.g.
this command from your management station:

  ipmish -ip 192.168.0.100 -u root -p <password> power off -force

This works great -- I can troubleshoot node boot-ups and installs from the
comfort of home.

Dell also offers an IPMI Serial Over Lan tool, but I find it clunky.  I look
forward to trying the open-source ipmitool package for SOL and other
functions.

David



More information about the Beowulf mailing list