decreasing #define HZ in the linux kernel for CPU/memory bound
Robert G. Brown
rgb at phy.duke.edu
Tue Apr 16 08:29:03 PDT 2002
On Tue, 16 Apr 2002, Cabaniols, Sebastien wrote:
> Hi beowulfs!
> Would it be interesting to decrease the #define HZ in the linux kernel
> for CPU/Memory bound computationnal farms ?
> (I just posted the same question to lkml)
> I mean we very often have only one running process eating 99% of
> the CPU, but we (in fact I) don't know if we loose time doing context
> switches ....
> Did anyone experiment on that ?
> Thanks in advance
This was discussed a long time ago on kernel lists. IIRC (and it was a
LONG time ago -- years -- so don't shoot me if I don't) the consensus
was that Linus was comfortable keeping HZ where it provided very good
interactive response time FIRST (primary design criterion) and efficient
for long running tasks SECOND (secondary design criterion) so no, they
weren't considering retuning anything anytime soon. Altering HZ isn't
by any means guaranteed to improve task granularity (the scheduler
already does a damn good job there and is hard to improve). Also,
because there are a LOT of things that use it, written by many people
some of whom may well not have used it RIGHT, altering HZ may cause odd
side effects or break things. I wouldn't recommend it unless you are
willing to live without or work pretty hard to fix whatever breaks.
The context switch part of the question is a bit easier. By strange
chance, I'm at this moment running a copy of xmlsysd and wulfstat (my
current-project cluster monitoring toolset) on my home cluster, where
(to help Jayne this morning) I also cranked up pvm and the xep
mandelbrot set application. So it is easy for me to test this.
During a panel update (with all my nodes whonking away on doing
mandelbrot set iterations) the context switch rate is negligible --
12-16/second -- on true nodes (ones doing nothing but computing or
twiddling their metaphorical thumbs). The rate hardly changes relative
to the idle load when the system is doing a computation -- the scheduler
is quite efficient. Interrupt rates on true nodes similarly remains
very close to baseline of a bit more than 100/second even when doing the
computations, which are of course quite coarse grained with only a bit
of network traffic per updated strip per node and strip times on the
order of seconds. So for a coarse grained, CPU intensive task running
on dedicated nodes I doubt you'd see so much as 1% improvement monkeying
with pretty much any "simple" kernel tuning parameter -- I think that
single numerical jobs run at well OVER 99% efficiency as is.
Note that on workstation-nodes (ones running a GUI and this and that)
the story is quite different, although still good. For example, I'm
running X, xmms (can't work without music, can we:-), the xep GUI,
wulfstat (the monitoring client), galeon, and a dozen other xterms and
small apps on my desktop; my sons are running X and screensavers on
their systems downstairs (grrr, have to talk to them about that, or just
plain disable that:-) and on THESE nodes the context switch rates range
closer to 1300-1800/sec (the latter for those MP3's). Interrupt rates
are still just over 100/sec -- this tends to vary only when doing some
sort of very intensive I/O. Note that even mp3 decoding only takes a
few percent of my desktop's CPU.
However, beautieously enough, when I do an xep rubberband update, I
still get SIMULTANEOUSLY flawlessly decoded mp3's (not so much as a
bobble of the music stream) AND the maximum possible amount of CPU
diverted to the mandelbrot strip computations and their display.
I view this delightful responsiveness of linux as a very important
feature. I've never hesitated to distribute CPU-intensive work around
on linux workstation nodes with an adequate amount of memory because I'm
totally confident that unless the application fills memory or involves a
very latency-bounded (e.g. small packet network) I/O stream, the
workstation user will notice, basically, "nothing" -- their interactive
response will be changed below the 0.1 second threshold where they are
likely to be ABLE to notice.
The one place I can recall where altering system timings has made a
noticeable difference in performance for certain classes of parallel
tasks is Josip Loncaric's tcp retuning, and I believe that he worked
quite hard at that for a long time to get good results. Even that has a
price -- the tunings that he makes (again, IIRC, don't shoot me if I'm
wrong Josip:-) aren't really appropriate for use on a WAN as some of the
things that slow TCP down are the ones that make it robust and reliable
across the routing perils of the open internet.
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf