[Beowulf] wulfstat, wulflogger fix, new features
Robert G. Brown
rgb at phy.duke.edu
Tue May 18 07:51:00 PDT 2004
Karl Bellve posted a bug in wulflogger that caused it to miss connecting
to the first host in the wulfhosts list until the second pass. He also
requested a feature that would let wulflogger execute only a single time
and then exit so that it could be used in e.g. a cron script to graze
for downed hosts in a cluster easily.
I found the bug (a legacy from wulfstat where I closed stdin pre-curses,
which caused the first port SUCCESSFULLY returned from socket() to be
returned as 0 (being reused) which actually doesn't work. This is a bug
in socket() I personally would say, but either way, when I eliminated
the close statement wulflogger now connects to the first host with no
problem first try.
I implemented the request by adding a -c count flag to both wulflogger
and wulfstat. -c 1 is the behavior requested, but somebody may have use
for the greater flexibility permitted by it being a variable. I also
updated both Usage and the man page for both applications, in the case
of wulflogger including an example fragment that might go into a cron
job to graze for down hosts and in the both cases adding a short section
on debugging (I've written the code to be tremendously self-debugging to
make it relatively easy to maintain or augment).
Still to implement:
a) I want to add a ping to the connection engine to precede the
xmlsysd connection attempt. ping actually is a bit of a pain -- the
usual iputils implementation requires suid root. nmap, however, has
three or four distinct ways of "pinging" that don't require root
privileges, and eventually I'll try stealing one although the code is a
lot more complex than I'd like for a simple task. Anybody with a
SHORT/SIMPLE version of userspace (e.g. ack) ping in C should feel free
to let me know where to find it.
b) I need to do something about tracking running jobs in wulflogger,
and figure out a better display for them in wulfstat.
c) I still have fantasies of writing gwulfstat on top of gtk. This
could be a very cool application.
d) And wulfweb needs love as well, although that is straightforward
web programming at this point -- wulflogger is the real tool involved.
Anyway, those of you who are using it, enjoy. Those who aren't,
consider giving xmlsysd/wulf[stat,logger,web] a try. It is a fairly
simple way to monitor an entire cluster (tested with order <100 hosts,
don't know how or if it scales to ~1000) in a lightweight fashion with
adjustable time granularity.
Those of you who are also LAN managers might consider using it to
monitor your LAN status as well. The default wulfstat/wulflogger
display is something like:
# Name Status Timestamp load1 load5 load15 rx byts tx byts si so pi po ctxt intr users
lilith up 1084891476.44 0.01 0.04 0.01 9761 7171 0 0 0 22 148 170
asixteencharname up 1084891476.44 0.01 0.04 0.01 9761 7171 0 0 0 22 148 170
lucifer up 1084891610.24 0.00 0.02 0.00 226 709 0 0 0 9 135 104
uriel up 1084887238.42 0.00 0.00 0.00 1030 1672 0 0 0 5 36 114
eve up 1084888284.75 0.00 0.00 0.00 685 1168 0 0 0 11 21 109
serpent up 1084877687.98 0.00 0.00 0.00 1116 1707 0 0 0 6 41 187
tyrial up 1084891762.44 0.00 0.00 0.00 3146 3064 0 0 0 9 208 218
archangel up 1084888715.71 0.00 0.00 0.00 119 1376 0 0 0 30 28 105
(used to look at my home cluster, with one machine turned off and one
machine down awaiting a reinstall.) There is a display that only looks
at load, a display only for network traffic, one for network usage, even
one that tells you uptime and duty cycle (cpu cycles used/cpu cycles
available) from the last boot. All GPL v2b...
I suggest rebuilding the source rpm or working from tarball, although
people running RH 9 can probably install the binary rpms without
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf