Motherboard / Benchmark Questions...

Thu Jun 15 07:10:26 PDT 2000

On Thu, 15 Jun 2000, Dean Waldow wrote:

> > the memory access pattern of your code in detail).  Is your Monte Carlo
> > algorithm is doing a random site update (and hence jumping all over
> > memory)?  Is there any way to organize it to operate more locally?
> 
> I think you are right in that it seems memory dependent with 128MB being
> enough and likely memory speed influenced at the least. The algorithm
> does pick a random spot in my 3D lattice and consequently I would say it
> does likely jump all over memory.  As to organizing the code to operate
> more locally, I don't know a simple way. I would have to really study
> the implications to the results to feel confident about that relative
> the simulation time savings.

Well, Monte Carlo is my bag in physics, and I've done quite a few
comparative studies of e.g. quench times (and autocorrelation times in
general) comparing the averages obtained and relaxation times and sample
independence times and so forth.  I'd be happy to look over your
problem/solution if you like to see if I think that it would make any
difference to use an e.g. typewriter or checkerboard algorithm instead
of random site selection.  The general rule is that if you are
evaluating macroscopic thermodynamic averages it doesn't -- if you are
simulating a time-dependent process and are measuring e.g. relaxation
rates it does, because autocorrelation relaxation is very much dependent
on thermalization model.  If you're doing something other than
importance sampling Monte Carlo, though, I'd have to look at it and
think about it (or you could just copy your source, alter the core loop
that uses random site selection to use typewriter instead, and do a test
run on a small lattice to compare the averages you obtain).

As a general rule, by the way, random site selection is the SLOWEST
method to converge, slower even than a shuffled (random without
replacement) selection strategy.  This is because the Poissonian process
leaves a lot of sites unvisited in any given Monte Carlo sweep.  In
fact, for a large lattice, there are often sites that aren't visited for
MANY sweeps.  These sites significantly delay the thermalization
process.

Anyway, I should probably not bore the list folks with statistical
physics... the remarks above were relevant enough for the list simply
because they emphasize the point that in many cases the "speed" of a
program depends strongly on the algorithm, and the algorithm of choice
need not be the "physical" one as long as it can be shown that one gets
the same (correct) answers.

> > > the long run.
> > 
> > The only safe way to compare is to test it.  My own tests of Athalons
> > with my Monte Carlo code were very disappointing -- I get by far the
> > best price performance on Celerons, as my code is generally local enough
> > to run satisfactorily with a 128 K L2 cache (even allowing for slower
> > memory).  The benchmarks I've run suggest that the Athalon's real
> > strength is its cache and memory subsystem.  However, your mileage may
> > vary considerably.
> 
> I hope to have an athlon test in the near future and will be interesting
> to see where it falls.  

I have access to one and can run your code for you if you send me a
tarball and instructions.  Or I can likely arrange for you to have an
"account for a day" to play with it if you send me an encrypted passwd
line to stick into our passwd file on the host.  I'm curious myself and
we got the (900 MHz) athlon mostly to test anyway.

> The (non)linearity with clock speed is much more understandable now.  I
> also have some benchmarks on a 733MHz PIII but have not been confident
> in them yet since I don't know much about the system they were run on
> yet and they seemed to be almost the same as the 550MHz/600MHz PIII
> tests I did.  If I get confident in that number, it would be consistent
> with the memory intensive nature of the code.  The interesting question
> then seems to be connected to memory bus speed.  Hence, processor speed
> can keep going up but if the memory bus is the limiting factor then you
> might not see much difference.  When I then get a benchmark on a system
> with faster bus, it might be pretty informative also. 

This really sounds like it is the case.  Random site selection brings
out the worst in your caching subsystem -- very few of the memory
references, especially for a large program, will be in cache so you are
slowed down to 40-150 nanosecond rates per reference, which typically
will leave your CPU twiddling its proverbial thumbs while waiting for
data to arrive.  There is a very nice mental image of this process in
Pfister's "In Search of Clusters" -- he compares CPU's to clerks working
away on a desk at whatever is there.  Every time they need a number or
instruction that isn't there, they have to call out to an old geezer
sitting propped in a chair who shuffles off the the main filing cabinet
(main memory) and eventually drops it on your desk.  I'd guess your
program is constantly waiting for the old codger, and so is relatively
insensitive to CPU clock but very sensitive to memory subsystem.

This is where the PIII most certainly beats the Celeron, although
neither of them comes close to the alpha family.  You might want to
borrow time on an alpha to run your benchmarks there as well.  Its
memory subsystem is MUCH faster than the Intel family.  The athlon might
do much better (highly nonlinear in the nominal clock) better for you as
well, although in ALL cases if you can reorganize your code to be 80-90%
cache-local (like "most" code is) you'll regain CPU clock sensitivity
and reduce the clock-equivalent gap between Intel, Athlon and Celeron.

   rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu