[Beowulf] anyone using SALT on your clusters?

Greg Lindahl lindahl at pbm.com
Mon Jul 1 13:12:32 PDT 2013


On Mon, Jul 01, 2013 at 02:01:13PM +0100, Jonathan Barber wrote:

> Pinging the host prior to connecting only determines that the IP stack is
> working, not that the OS is capable of handling an ssh connection.

Indeed, that's why we actually have a more complicated liveness
algorithm, starting with a syn ping, and following up with "echo foo",
which is a good check for a host whose kernel is fine but is actually
hung because the system disk is out to lunch, in our case usually due
to a screwed up raid controller.

Even without this liveness algorithm, we wouldn't have a "hang" due to
ssh hanging because we use appropriate timeouts.

> WRT to timeouts, the problem is determining whether a timeout means that
> the host is blocking with no possibility of responding (e.g. the NFS mount
> problem) or that the host is busy and had half completed the command before
> it was terminated by the timeout.

And that's how we ended up with the "echo foo" test.

> For me, this results in the practical difference that the pub-sub model
> means that the agent has the ability to subscribe to the messages and is
> therefore alive - and that therefore the list of live hosts is always
> current.

Our NoSQL database uses pub-sub for cluster membership, and we found
that the hosts with screwed up RAID controllers could easily stay in
the cluster even if they were really screwed up. We had to add some extra
watchdogs and tests that the system disk is working.

Really, before you say that some particular method doesn't work or is
bad, you should ask if anyone's successfully using that method, and
how they've worked around any problems they ran into. As you can see both
ssh and pub-sub required some workarounds in my environment.

-- greg





More information about the Beowulf mailing list