Motherboard / Benchmark Questions...

Thu Jun 15 12:48:21 PDT 2000

On Thu, 15 Jun 2000, Chris wrote:

> On Thu, 15 Jun 2000, Robert G. Brown wrote:
> [snip]
> > independence times and so forth.  I'd be happy to look over your
> > problem/solution if you like to see if I think that it would make any
> > difference to use an e.g. typewriter or checkerboard algorithm instead
> > of random site selection.  The general rule is that if you are
> [snip]
> > 
> > As a general rule, by the way, random site selection is the SLOWEST
> > method to converge, slower even than a shuffled (random without
> > replacement) selection strategy.  This is because the Poissonian process
> > leaves a lot of sites unvisited in any given Monte Carlo sweep.  In
> > fact, for a large lattice, there are often sites that aren't visited for
> > MANY sweeps.  These sites significantly delay the thermalization
> > process.
> [snip]
> 
> Dr. Brown,
> 
> Could you discuss the typewrite, checkerboard and
>  random site selection algorithms for MC a bit more? 
>  
> I think to the extent they would effect the time to completion
> because of cache locality that more information on the algorithm
> is on topic for the list.  I am certainly interested :>

Hmmm, the best thing to do is to start with a practical reference, like
Binder and Heerman's book on Computational Monte Carlo (or some such, my
copy is out on loan so I don't have the exact title handy -- it's a
Springer-Verlag).

The following is a very short (really:-) survey on importance sampling
Monte Carlo methology.  Many folks on the list will probably want to
just hit "d" about now, if they haven't already.

What?  Still there?  OK, you asked for this, so here it is.

If one is using Monte Carlo to do statistical physics, one generally is
trying to do an average over a distribution weighted with the Boltzmann
factor, \exp{-E_i/kT} where E_i is the energy of the ith configuration
(or state), k is the Boltzmann constant, and T the absolute temperature
in appropriate units (E/kt must obviously be dimensionless).  The
normalization factor for the weights, and the quantities being averaged,
must technically be summed over >>all states<< (or configurations).

There are, as it happens, a rather lot of them (states/configurations).
Even when it is finite (and often it is infinite), as finite numbers go,
this is one that really really wants to be infinity.  Consequently,
doing the actual sum over all states is out.

The next best thing to do is to try to take advantage of the fact that
at any given temperature, the normalized weight factor is nearly zero
(>>very<< nearly zero) for very nearly all of the configurations/states
that can occur.  If only we could find the >>important<< ones and
average over them, we'd be able to (maybe) get a decent thermodynamic
average in less than infinite time.

By great good fortune (or some very elegant mathematics, you choose:-)
it just so happens that there exists something called detailed balance
between all of these configurations.  That basically means that if one
defines a transition operator that carries one from one state or
configuration to another, at equilibrium the flow (of probability
weight) into state a from all other states must be counterbalanced by
the flow out of state a to all other states.  By writing a differential
equation for this greatly-to-be-desired condition and doing a bit of
algebra, one can show that this occurs when the transitions themselves
are weighted according to the Boltzmann distribution, but just on the
>>local<< change in the energy of the configuration(s).

Voila!  Thus the possibility of a Markov Process is born -- if one can
create a Markov Process that carries one from state a (any
configuration) to any other state b with a Boltzmann weight, one can
iterate the process to go from state a->b->c->... AND one can show that
after "a while" the states one ends up in nearly all of the time are
precisely the ones that contribute the bulk of the statistical weight of
the full sum over all states.  In fact, the Markov Chain thus
constructed wanders about through all non-forbidden states and
ergodically populates them according to their density in the original
weighted sum.

However, there are many, many possible Markov Processes that satisfy the
rather weak condition on the ratio of transition probabilities required
to lead to equilibrium and a valid importance sampling sequence.
Without going into detail as to what they are or how they work, some of
the best-known of them are:

Metropolis, which is an accept-reject method that takes configuration a,
randomly generates a new configuration b that is typically "close" to a
in some sense, and compares their energies.  If E_b is less than E_a,
the new configuration is "accepted", and becomes the new configuration
a.  If it is greater than E_a, a (pseudo)random number between 0-1 is
compared to \exp{(E_a - E_b)/kT} (which is always between 0 and 1) and
if it is less than the move is still accepted.  Running back and forth
between a and b with this algorithm one can show that in time they will
be relatively populated by precisely the ratio required by the boltzmann
weight, and ditto as more and more states are opened up as candidates
for moves.

Heat Bath, which actually generates a new configuration b (still "close"
to configuration a) by directly selecting it from a Boltzmann
distribution in a sub-ensemble (sorry about the big words).  This is
possible only if one can invert the known probability distribution for
the entity(s) being changed while leaving the rest of the state alone.
This in turn is only possible for certain models -- the inverse function
is frequently a nasty hypergeometric thingie that can only be evaluated
numerically so one might as well use Metropolis instead.

Both of these use a "local" move -- they typically act on whatever is at
a single point in a lattice while holding the rest of the lattice
frozen.  In my work, for example, I'd alter the state of a single spin
at a lattice vertex in the field of its neighbors, which remain fixed.
This causes the change to be "small", which makes it a lot more likely
that one stays "near" the equilibrium region.

Cluster Methods are a relatively new Markov Process that alters an
entire block of objects (e.g. spins) at once.  They work by applying the
Metropolisish accept-reject decision not to just one object, but to a
cluster.  In a nutshell, a random spin is moved into a new state by
means of an operator (perhaps rotating the spin through some fixed
angle.  A spin next to this spin is selected.  The same operator is
applied to it, and the energy change with only those spins on its
>>boundary<< evaluated (and a random number compared to a Boltzmann
factor).  If the RNG is less, the spin is added to the cluster.  This
process is repeated for >>all spins connected to the cluster by an
interaction bond<< until the "reject" decision terminates the process.

This process selects a whole cluster of spins and moves them identically
in just such a way that the new energy of the whole system has changed
according to the Boltzmann weight.  Note that the "bond" energies inside
the cluster and outside the cluster are unchanged -- the only thing that
changes is the energy of the surface layer between the cluster and the
outside, which was thermalized in detail by the accept-reject procedure.

In all cases, when one has iterated long enough that one believes (or
rather can show) that "equilibrium" has been achieved, one takes one or
more steps in the Markov Chain, evaluates all the quantities that one
wishes to thermodynamically average and adds them (and possibly their
higher moments) to the running sum(s) required to generate the averages
and generally a higher moment or two.

This now leads us to an important propertly of this sort of Markov Chain
and statistics.  As noted repeatedly above, the "moves" selected are
deliberately small ones -- they typically return a new configuration
that is very "close" to the old one.  So close, in fact, that as a
general rule it is NOT statistically independent -- its autocorrelation
with the previous state is very high.  This is Bad.  The (in my opinion
anyway) principle theorem of statistics (the Central Limit Theorem, from
which averages and variances and the like obtain their practical
meaning) requires independent, identically distributed samples from the
beginning, and the Markov Chain produces samples that are not
independent.

However, the further one advances from the initial state, the "more
independent" they get, in the sense that the average autocorrelation
either exponentially vanishes or exponentially approaches its limiting
value, depending on the underlying state of thermal order, as a function
of the number of steps.  The different methods above have VERY different
autocorrelation properties.  In fact, the autocorrelation timescales
can even scale differently with lattice size, especially near a
"critical" temperature.

This leads us (at last) to define the terms requested above and indicate
why and how they are used.

If one views the weighted transitions that are used in the Markov Chain
as being motivated by physics (for example, a random quantum transition
of a spin's state) then of course the location of the transition will be
random.  One might then be tempted to use a random site selection rule
in the Monte Carlo model of the physical process.

If one is measuring autocorrelation relaxation properties, this is a
good idea.  As previously noted, a random site selection will typically
>>miss<< a large fraction of the N lattice sites in N random selections
with replacement.  The missed sites don't have their state changed at
all, and of course contribute strongly to the autocorrelation.  The
autocorrelation times will thus be very different (and far, far longer)
for a random site selection than from almost any other alternative.

On the other hand, if one is NOT concerned with time at all, but just
wants the best possible average in the shortest possible time, it is a
very poor way to proceed.  The Markov Process doesn't care at all if you
select sites randomly or not -- it will inexorably move you toward and
then sample equilibrium provided only that the method of generating new
states doesn't leave any part of the phase space inaccessible (in the
sense of being disconnected -- cannot get there from here in any
possible set of moves).

SO, an alternative is to use a "typewriter" or "left-right" site
selection method -- simply go in loop order through the lattice and
change the state of each spin (or other object) as its index comes up.
In N small moves, every spin is in a new state if heat bath is used.  If
Metropolis is used, unfortunately accept-reject methods have the same
problem that random site selection produced -- a lot of moves were
presumably rejected so some fraction of the spins are unchanged and the
autocorrelation decay thereby proceeds undesirably slowly.

A typewriter heat bath isn't bad at bond thermalization.  Every bond
energy gets changed >>twice<< per "sweep" of N moves.  However,
structures in the lattice still change relatively slowly, which is what
motivates cluster methods that rearrange whole clusters at once.
Cluster methods, however, only thermalize bonds on the cluster surface,
leaving one with a surface to volume problem that slows them down.  The
best thing to do is typically mix cluster methods with a full
local-lattice sweep, ideally heat bath if possible.

A shuffled heat bath is random site selection without replacement.  It
is basically like shuffling the sites like a deck of cards and then
going through the deck one after another, putting the sites completed
into the "discard pile" and NOT back into the deck.  It's a lot more
work, it doesn't give you a valid autocorrelation time, and the
autocorrelation time it DOES give you is very, very close to the
typewriter time.  Ergo, it is a waste of time (generally speaking) to
shuffle, although there may be an exception somewhere that I haven't yet
encountered.

A "checkerboard" method is an improvement on typewriter designed to make
vector (and sometimes cache) operations work better.  The energy of the
ith site typically depends on the state of its nearest neighbors.  On a
lattice, they might be the dark squares surrounding a selected light
square.  If one evaluates the "field" on the light sites due to the dark
sites in one loop pass, one can use the results in a single pass to
update all the light sites (leaving the dark ones untouched).  One then
does the dark sites.  Indeed, one can usually do the site update and the
field update at the same time (with diffs) and just rip through dark
light dark light... hence the term checkerboard.

Clearly both typewriter and checkerboard have considerable locality,
depending on the dimensionality and size of your lattice.  Equally
clearly, the lattice can be blocked and reordered in various ways to
achieve all sorts of superlinear speedups at e.g. cache boundaries, if
it weren't for the fact that it is so damn difficult to write portable
code that is that smart.  Finally, lattices can typically be split
across nodes with IPC's (for local interactions) that scale like surface
to volume.  Put all that together, and lattice Monte Carlo becomes an
interesting parallel computation problem indeed.

My one wish at this point would be for the ATLAS project to spawn a
kernel module that does, once and for all, certain benchmark
measurements (basically the ones used in ATLAS to tune things up, plus
others that might suggest themselves) and publishes them in proc.  One
could then write some simple systems calls in a library to return them
and (re)use the values portably in code.  It is, after all, rather silly
to have to run the entire ATLAS autotuning suite to build the libraries
-- the key parameters should be evaluated separately and used as just
that -- parameters -- that can be used elsewhere to tune similar things
up.  IMHO, of course.  I hink that ATLAS might end up being the most
important single performance enhancing concept to hit computing in the
last four or five years and would truly love to see it extended and
generalized.

 I hope this helps.  I know that it is long, but BELIEVE ME -- it's
short.

    rgb

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu