No subject

Thu Jun 12 22:07:40 PDT 2014

access, backward access, and random access take about the same amount of
time.  Note that the code being timed in the test is:

 for (i=0; i<size; i+=stride){
   aindex = ai[i];
   aitmp = ai[aindex];
   ai[aindex] = aitmp;
 }

less the time required for (separately evaluated):

 for (i=0; i<size; i+=stride){
   aindex = ai[i];
 }

where the only difference between sequential, reverse sequential and
random is how the ai[] vector is filled with its own indices.  There
will therefore be a small penalty when the vector's two ends won't both
fit in cache at the same time.

As we increase the size of the vector to roughtly half of L2 (and much
bigger than L1):

rgb at ganesh|T:910>cpu_rate -t 7 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 77094.8508 +/-    91.4151 nsec
#
========================================================================
# Timing test 7
# Samples = 100  Loop iterations per sample = 32
# Time: 231774.8517 +/-   166.2378 nsec 
#========================================================================
# Sequential Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.621482 avg_time_empty = 1.204607 
# Average Time:   2.42 nanoseconds
# BogomegaRate: 413.76 megaseqmem int read/writes per second

rgb at ganesh|T:909>cpu_rate -t 8 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 77434.2657 +/-   517.2918 nsec
#
========================================================================
# Timing test 8
# Samples = 100  Loop iterations per sample = 32
# Time: 220100.5173 +/-   134.2393 nsec 
#========================================================================
# Sequential (backwards) Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 3.439071 avg_time_empty = 1.209910 
# Average Time:   2.23 nanoseconds
# BogomegaRate: 448.60 megabackseqmem int read/writes per second

rgb at ganesh|T:918>cpu_rate -t 9 -s 32000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 128
# Time: 76797.9778 +/-    39.3887 nsec
#
========================================================================
# Timing test 9
# Samples = 100  Loop iterations per sample = 32
# Time: 400156.5388 +/-   168.9157 nsec 
#========================================================================
# Random Integer Memory (read/write) Access:
# size = 32000  stride = 1  vector length = 128000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = shuffled index.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 6.252446 avg_time_empty = 1.199968 
# Average Time:   5.05 nanoseconds
# BogomegaRate: 197.92 megarandmem int read/writes per second

These numbers are pretty reproducible, and suggest that as long as the
vector is entirely in L2 it makes little difference whether it is
accessed forwards or backwards, and backwards actually seems to have a
tiny advantage, at least for this code.  Random access times, on the
other had, are stretching out.  This also suggests that cache is
"working" pretty well -- L1 is hiding L2 memory access times for either
sequential direction of access, but random access patterns defeat about
half its ability to do so (assuming L1 times of 1 ns for read or write,
L2 times of about 9-10 ns).

When we stretch the vector out to much larger than L2:

rgb at ganesh|T:919>cpu_rate -t 7 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11713794.0611 +/-  3828.4599 nsec
#
========================================================================
# Timing test 7
# Samples = 100  Loop iterations per sample = 2
# Time: 21220962.0875 +/- 32389.0768 nsec 
#========================================================================
# Sequential Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 10.610481 avg_time_empty = 5.856897 
# Average Time:   4.75 nanoseconds
# BogomegaRate: 210.37 megaseqmem int read/writes per second

rgb at ganesh|T:920>cpu_rate -t 8 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11752219.3785 +/- 31999.0185 nsec
#
========================================================================
# Timing test 8
# Samples = 100  Loop iterations per sample = 2
# Time: 30729268.6483 +/- 32411.0175 nsec 
#========================================================================
# Sequential (backwards) Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = i initially.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 15.364634 avg_time_empty = 5.876110 
# Average Time:   9.49 nanoseconds
# BogomegaRate: 105.39 megabackseqmem int read/writes per second

rgb at ganesh|T:921>cpu_rate -t 9 -s 1000000
#
========================================================================
# Timing "Empty" Loop
# Samples = 100  Loop iterations per sample = 2
# Time: 11714174.2590 +/-  4242.4690 nsec
#
========================================================================
# Timing test 9
# Samples = 100  Loop iterations per sample = 2
# Time: 238072616.9525 +/- 92417.8420 nsec 
#========================================================================
# Random Integer Memory (read/write) Access:
# size = 1000000  stride = 1  vector length = 4000000:
#   aitmp = ai[aindex]
#   ai[aindex] = aitmp
#   where aindex = ai[i] = shuffled index.
# NANOTIMER granularity (nsec/cycle) =  0.750
# avg_time_full = 119.036308 avg_time_empty = 5.857087 
# Average Time: 113.18 nanoseconds
# BogomegaRate:   8.84 megarandmem int read/writes per second
80.960user 0.013sys 99.8%, 0ib 0ob 0tx 0da 0to 0swp 1:21.10

These are very interesting numbers indeed.  Sequential latencies are
still about half what I think the L2 latency itself is at 4-5
nanoseconds, suggesting that memory prefetch is doing remarkably well at
BOTH moving the data from memory into L2 AND moving the data from L2
into L1 before it is actually needed.  Pretty remarkable, really -- the
system is only about 2.5x slower than when it ran straight out of L1.

Reverse sequential latencies (where some of the cache's efficiency is
admittedly degraded due to the particular order of access of ai[]
elements in the loops above) seems to be very nearly what I expect L2
latency to be -- around 9-10 nanoseconds.  Perhaps a better way of
splitting things out would push it as low as 8 nanoseconds, but this is
anywhere from 60-100% slower than sequential access.  It is still quite
fast.

There are a number of possible interpretations of these numbers.  One
might be that the memory prefetch is >>still<< working pretty well from
memory into L2, but that the data isn't making it through into L1 as
effectively as it does when the entire vector is in L2?  

Alternatively, the number could just be an average of increases in the
reversed prefetch time from memory into L2 that happen to work out to
close to the L2 access time.  In this case, it could be SDRAM DDR itself
that is partly at fault.  I seem to recall that it has a fairly hefty
penalty for the latency of the first byte, and significantly reduced
penalties for subsequent SEQUENTIAL bytes.  It may be that reverse order
access prevents it from realizing this advantage.  It would be very
interesting to rewrite the streaming tests so that they proceeded by one
byte at a time (up or down), two bytes, four bytes, eight bytes, with
various fairly deliberate starting offsets.  Could the reverse slowdown
be associated with e.g. word boundaries?  This would require some
additional tests (and maybe a new parameter for the offset) to be added
to cpu_rate, but it might be worthwhile to give it a shot.

I don't know if this helps at all, but it was an interesting afternoon
playing and forced me to clean up cpu_rate (again).  Maybe even enough
that it won't segfault the first time somebody tries it:-).