[Beowulf] Re: dual core (latency)

Tue Jul 19 03:30:08 PDT 2005

On Tue, 19 Jul 2005 06:42:02 +0200, Vincent Diepeveen wrote
> At 11:05 AM 7/19/2005 +0800, Stuart Midgley wrote:
> >The first thing to note is that as you add cpu's the cost of the  
> >cache snooping goes up dramatically.  The latency of a 4 cpu (single  
> >core) opteron system is (if my memory serves me correctly) around  
> >120ns.  Which is significantly higher than the latency of a dual  
> >processor system (I think it scales roughly as O(n^2) where n is the  
> >number of cpu's).
> >
> >Now, with a dual core system, you are effectively halving the  
> >bandwidth/cpu over the hyper transport AND increasing the cpu count,  
> >thus increasing the amount of cache snooping required.  The end  
> >result is drastically blown-out latencies.
> >
> >Stu.
> 
> This doesn't answer even remotely accurate things.

Actually it was a very well written and quite accurate discussion of what you
were seeing.

> A) my test is doing no WRITES, just READS.

Doesn't matter, unless you turn off all cache effects on the memory you are
dealing with.  A memory write is a read-modify-write operation, and memory
read is a read operation.  You still require that initial "snoop" to grab the
cache line.  You basically ask all the other processors that have the
potential of sharing that cache line to look into which lines they have in
cache, and if they have the line in question, please flush that line if it is
dirty (e.g. a pending but uncommitted write exists).  Otherwise, please hand
over the cache line with all due speed.

Its not "complex" with 2 CPUs, just a little costly.  It gets complex and time
consuming with 4.  At 4 and higher it is one of the issues you take into
consideration when optimizing code.  This is also why processor affinity is so
important, as you can (to a degree) pre-bias where the pages (and hence cache
lines) are sitting relative to the CPU, and tie the memory and processor
together.  This increases the likelyhood of the line being local, as well as
potentially decreases the likelyhood of the line being needed remotely.

> B) snooping might be for free.

Absolutely not.

> C) all other cores are just idle when such a latency test for just 1 
> core happens and the rest of the system is idle.

The only way you can guarantee that the other cores are "idle" is to turn them
off.

> D) in all cases a 
> dual core processor has a SLOWER latency and it doesn't make sense. 

Makes a great deal of sense as Stuart has pointed out.  Your snooping
algorithm is somewhat better than O(N**2) on a system with a directory. 
Without a directory it is closer to O(N**2).  The more snooping you need to do
before getting a cache line, the more latency you pay to get that initial
cache line.  A directory based system is effectively a hash table.

> E) you don't seem to grasp the difference between LATENCY and BANDWIDTH;

Hmmmm.  I think Stuart gets it very well.  I am not convinced that you get the
issue of how important and expensive cache line processing via snoopy
algorithms is, and what its impact upon overall processing time is.

Joe Landman

--
Scalable Informatics LLC
http://www.scalableinformatics.com
phone: +1 734 786 8423