Processor contention(?) and network bandwidth on AMD

Mon Apr 29 14:07:24 PDT 2002

On Mon, 29 Apr 2002, Joshua Baker-LePain wrote:

> > > unloaded:                                         11486.6 KB/real sec
> > > 2 matlab simulations:                             10637.8 KB/real sec
> > > 2 matlab simulations and 2 SETI at homes (nice -19):  6645.4 KB/real sec
> > 
> > SETI at home is obviously in the "so don't do that" category.  I expect your
> > matlab was decelerated by a similar amount.
> 
> Sure, but it was just an example of a niced background load, which 
> "shouldn't" interfere with anything.  It certainly shouldn't crash 
> bandwidth like that.

Joshua,

Actually, running a heavy background load can (as you have observed)
significantly affect network times, especially if it is the receiver
that is loaded.  As to whether or not it "should", I cannot say (kind of
a value judgement there:-), but one can try to understand it.  There are
deliberate tradeoffs made in the tuning of the kernel and for better or
worse the linux tradeoffs optimize "user response time" at the expense
of a variety of things that might improve throughput on a purely
computational load or throughput on the network or pretty much anything
else.  Sometimes one can retune -- Josip Loncaric's TCP patch is one
such retuning, but one can also envision changing timeslice granularity
and other things to optimize one thing at the expense of others.
Generally such a retuning is a Bad Idea.  Right now the kernel is pretty
damn good, overall, and all components are delicately balanced.  As
Mark's previous reply made clear, some naive retunings would just lock
up the system (or really make performance go to hell) as important
components starve.

It isn't too hard to see why loading the receiver might decrease the
efficiency of the network.  Imagine the network component of the kernel
from the point of view of the stream receiver (not the transmitter).  It
never knows when the next packet/message will come through.  The kernel
does its best to do OTHER work in the gaps between packets by installing
top half and bottom half handlers and the like (so it does no more work
then absolutely necessary when the asynchronous interrupt is first
received, postponing what it can until later) to provide the illusion of
seamless access to the CPU and other resources for running processes.
One side effect of this is that there are times when the delivery of
packets is delayed so that a background application can complete a
timeslice it was given "in between" packets when the system was
momentarily idle.

What this ends up meaning is that when the system is BUSY, it de facto
delays the delivery of packets that it has buffered for fractions of the
many timeslices of CPU the system is allocating to the competing tasks
when the network process is momentarily idle (blocked, waiting for the
next packet). If it didn't do this a high speed packet stream could (for
example) starve running processes for CPU by forcing them to wait for
the whole stream to complete.

Processing the text of TCP packets (not to mention the interrupts and
context switches themselves) is a nontrivial load on the CPU in its own
right, so much so that people try NOT to run high-performance network
connections for fine-grained code over TCP if they can avoid it.  The
network stack ends up contending for CPU with everything else that is
running, and it makes no sense to retune things so that this is never
true as the cure will likely be worse than the disease for most usage
patterns.  

Curiously, transmitting works more efficiently than receiving, probably
because the transmitter is in charge of the scheduling.  In very crude
terms the transmitter is never interrupted or delayed by other processes
-- it just gets its timeslice, executes a send or stream of sends,
eventually blocks (moving up in priority while blocked) or finishes its
timeslice, and then moves on.  No delays to speak of.

Try this:

  Do your netpipe transmitter on an unloaded host, a host at load 1 and
at load 2.
  Do your netpipe receiver on an unloaded host, a host at load 1 and one
at load 2.

Fill in the matrix -- load 0 to load 0, load 0 to load 1, etc.

I found (in similar tests done years ago) that a TRANSMITTER could be
loaded to 2 (per cpu) with only a small degradation of throughput, but
loading a RECEIVER would drop throughput dramatically, by as much as
50%.

  rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu