[Beowulf] Configuration change monitoring

Thu Aug 30 05:18:07 PDT 2007

On Wed, 29 Aug 2007, Mark Hahn wrote:

>> There is a big push where i work to use commercial monitoring, and service
>
> I'm terribly sorry.  this is a sign that accountants have eaten
> the brains of your IT heads.
>
>> What do HPC sites use for configuration/change management, and 
>> configuration
>> monitoring other than cfengine, pupet (agent based), are there any 
>> agentless
>> monitoring tools available to check for inconsistencies of of a node from
>> other nodes, or deviation from a policy.
>
> HPC clusters are normally a horde of clones.  no configuration change
> is applied individually to a node, but rather applied en-mass.
> reimaging nodes is not a huge big deal, for instance (and a non-event
> if you use nfs-root - definitely a good idea in some cases.)

To amplify Mark's remark a bit -- linux in general already has many
fairly powerful tools for DOING monitoring, updates, and so on.  For
example, one can use e.g. yum, kickstart, apt, and more to install a
"canned" node configuration and keep it up to date.  It is so totally
automatic that there is basically no need to "check for inconsistencies"
on a node. Warewulf and several other tools also permit one to have
rigorous control over node configuration.

Nodes tend to be very slowly varying in configuration.  If they are
firewalled (so update streams aren't an issue) and used for a very few
computations built on top of the standard libraries, they might well be
installed and NEVER change until they are turned off four or five years
later.  Or they might be updated nightly by yum (not changing their
basic package configuration) and upgraded once in mid-life (not changing
their basic package configuration).  So configuration management is, as
Mark says, not something that is worth the time to automate beyond the
substantial automation already present in rpm or debian update
mechanisms.  Even if (say) one's nodes were installed without the GSL
libraries and somebody wanted an application that used them, it would
usually be a matter of a single command into e.g. a parallel shell or
adding a file to a list of packages to be caught by the automatic
nightly yum update and waiting a day and poof, it would be there.

"Policy" control is a somewhat different issue, as it means different
things to different people.  On many clusters, resource allocation and
policy is set by the cluster owner telling his or her minions who gets
to run on what.  On others, it is managed by batch management tools,
e.g. SGE or OpenPBS or whatever, which allows some degree of software
control over who runs what where and when.  For really fine grained
"policy control" there is Condor, which actually started its existence
as a policy/resource management tool more than anything else.  The
biggest hassle associated with any of these tools is that they tend to
be complex.  Policy/resource control is intrinsically complex,
especially if one has rules like "I want my sub-cluster instantly
available from 8 am to 8 pm, but other people can access it all night
for their long running jobs, except for holidays and lunch hour when I
also don't care who uses it".  So people don't tend to use them unless
they must, but if you need them, you need them.  Fortunately they are
there to be used.

Monitoring tools abound, of course, ranging from things like syslog-ng
for centralized monitoring/logging of LAN systems activity to nightly
system summary scripts that report on how many IP addresses tried to
crack your systems and what your summary disk usage and so on are (admin
controllable, of course).  For people more interested in monitoring
cluster state from the system and network load point of view, there are
ganglion and wulfware and bproc (and probably other tools as well).
With them one can see at a glance what the load is on all the nodes in
the cluster, what the network is doing, whether or not the nodes are
swapping or paging, whether or not a node rebooted unexpectedly in the
middle of the night, and what jobs all the nodes are actually running.

Is there something else one needs to do?  Well, most cluster admins tend
to be fairly skilled linux administrators and good at shell script magic
or even real programming.  So if one has an edge-case need that isn't
directly met by one of the available tools, it is usually a fairly
simple matter to hack out a script to accomplish it.  Since scripting
languages like perl got threads, one can actually fairly easily write a
perl script that can initiate and control a complete distributed
application that for example runs through a list of nodes by name or IP
number, forks off a thread for each and runs an ssh command to each of
them, collects the returns, and does whatever with them.  Obviously such
a control program can be as complex as you like, and do whatever you
like, and run as often as you like.

So if you really want to know exactly what user frobnitz is doing at a
granularity of one minute (suspecting that he's sneaking in and running
his jobs when he isn't supposed to) it isn't too difficult to distribute
a sleep/loop of ps aux | grep frobnitz and collectivize the results on a
control node where you can look them over later.  Although most of us
would solve such a problem by going down to frobnitz's office with a
sucker rod in hand and tap it gently on top of his or her head while
pointing out that individuals who sneak in jobs when they aren't
supposed to will be firmly schooled...;-)

      rgb

> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit 
> http://www.beowulf.org/mailman/listinfo/beowulf
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu