[Beowulf] any creative ways to crash Linux?: does a shared NIC IMPI always remain responsive?

Sat Oct 24 15:13:19 PDT 2009

Now that I have remote-IPMI and SOL working my next step is to try and
crash Linux to see if there might be "pathological crash cases" where
I will end up having to go to the server room. So far, whatever I do
I'm pleasantly surprised that "chassis power cycle" always seems to
work!

I tried:

 `echo "c" > /proc/sysrq-trigger` to produce kernel panic. The node
still reboots on its IPMI interface.

What surprised me was that even if I take down my eth interface with a
ifdown the IPMI still works. How does it do that? I mean I am using
the shared NIC approach and I was expecting the IPMI to clam up the
moment the OS took a port down.

On Sept 30 Joe Landman said:

>After years of configuring and helping run/manage both, we recommend strongly *against* the shared physical connector approach.  The extra cost/hassle of the extra cheap >switch and wires is well worth the money.
>Why do we take this view?  Many reasons, but some of the bigger ones are

(I know Joe Landman and others had warned me against this but I tried
to start with configuring a single shared NIC and then go for two
NICs. Just keeping things simple to start with.)

But my single shared NIC results seem good enough already. Which is
why I was trying to see if there are any worse possibilities of
crashes that will render contacting the IPMI impossible.

On Sept 30 Joe Landman said:

>a) when the OS takes the port down, your IPMI no longer responds to arp requests.  Which means ping, and any other service (IPMI) will fail without a continuous updating of the >arp tables, or a forced hardwire of those ips to those mac addresses.

Another point that surprises me is how the IPMI kept working even
after CentOS took the port down. I definitely see Joe Landman's
arguments about why it shouldn't be responding to ARP's any more
(unless I did something special). That's why I am a bit surprised that
my IPMI I/P continues to respond to the pings even after the primary
I/P is dead.

#Ping primary I/P address
ping 10.0.0.25
[no response]

#Ping IPMI IP address
ping 10.0.0.26
PING 10.0.0.26 (10.0.0.26) 56(84) bytes of data.
64 bytes from 10.0.0.26: icmp_seq=1 ttl=64 time=0.574 ms
64 bytes from 10.0.0.26: icmp_seq=2 ttl=64 time=0.485 ms

Interestingly arp shows the primary IP as incomplete but the secondary
IP resolves to the correct IP. This means that the BMC continues to
respond to the second MAC even after the OS took the eth port down.
How exactly does this "magic" happen. I'm just curious.

node25                           (incomplete)                              bond0
10.0.0.26                ether   00:24:E8:63:D6:9E   C                     bond0

Another mysterious observation was this: Whenever I took eth down via
the OS there is a latent period when the IPMI stops responding but
then somehow it magically resurrects itself and starts working again.

Just making sure this isn't a fluke case......Any comments or more
disaster scenario simulations are welcome!

-- 
Rahul