OT: computationally hard or not (algorithm question)
Robert G. Brown
rgb at phy.duke.edu
Fri Mar 23 11:36:23 PST 2001
On Fri, 23 Mar 2001, Frank Joerdens wrote:
> Consider an n-dimensional array containing something on the order of
> 10^4 (i.e. thens of thousands) elements, where each number may assume an
> integer value between 1-12. For the sake of the argument (it's easier to
> visualize; actually, n would be something on the order of a dozen),
> assume that n is 3, so you're basically looking at a box-shaped cloud of
> dots in 3-D space, where each dot's position is defined by its
> three-dimensional carthesian co-ordinates which are stored in the array.
>
> Now, take any one of those dots, search for the 10 (some figure on this
> order of magnitude) dots which are closest to it, and order those by
> proximity to the origin of the search. This sounds pretty hard,
> computationally. Not for 3 Dimensions, since there the number of
> possible positions is only 12^3 = 1728, but for 12, its 12^12 =
> 8916100448256. I guess you'd have to find some efficient shortest-pair
> algorithm that works for n dimensions (my algorithm book only has one
> that works for 2), find the shortest pair, remove it from the array,
> then find the next value etc..
>
> Does anyone have an idea, or a pointer to some reading matter about it,
> as to how computationally expensive that would be? Any guess as to what
> kind of hardware you'd have to throw at it (I could settle for less
> dimensions if it turns out that I can't afford the hardware) when I want
> the result set within a couple of seconds at the _very_ most?
>
> Does anyone have a link to an article about the best algorithm for this
> kind of problem?
Not per se, but this problem (or problems that significantly overlap
with this problem) arise fairly often in both physics and in
multivariate statistics. I'm not absolutely certain that I understand
your problem. As you describe it, it doesn't sound all that
computationally complex. However, it does sound very close in structure
to problems that are complex. There are two generically different ways
to approach the solution, depending on which category it is in.
You indicate that you intend to pick a point and then look for its ten
nearest neighbors (according to some metric you haven't mentioned, so
I'll assume a flat Euclidean metric). If I understand this correctly
(and there aren't any complications that kick the problem into the
"complex" category) a solution might look like:
typedef struct {
double coord[12];
} element;
element elements[10000];
double distance[10000]
then of course it is simple to pick an element (or a point), evaluate
the distance of each element in the elements from the element or point
(storing the result in e.g. distance[i]) and sort the indices of
distance (not the list itself). This scales no worse than N plus the
scaling of the sort algorithm used, and the scaling has nothing to do
with the number of points in the space. Indeed, each element can
contain real number coordinates (hence double coord[12]) instead of one
of 12 possible integer values with some integer metric and it won't make
a lot of difference in the time required for the computation and no
difference at all in the scaling and a beowulf is likely not needed
unless you plan to do this a LOT.
Alternatively, the problem may be something where you have to store the
elements in an array that maps to the spatial coordinates -- you have a
************Space array addressable as
Space[a][b][c][d][e][f][g][h][i][j][k][l] where each array element
contains either 0 or one of 12 values. You then have to pick a point (a
nontrivial operation in and of itself, as if a-l can be 1-12 each and
there are only 10^4 slots that aren't 0, you'll have to pick random
coordinates a lot of times to find one that doesn't contain 0. Since
this array is sparse, the sane thing would be to invert it as described
above; however, one COULD always e.g. search concentric 12-cubes around
the point until one finds ten entries. This would scale terribly --
remember the space is only 0.0000001% occupied (or something like that
-- I'm not doing the arithmetic carefully) so you'd have to search
rather large cubes around the point before you were at all "likely" to
find 10 hits if the elements aren't spatially bunched in their
distribution.
There are a number of physics and statistics problems that lie between
these two extremes -- the dimensionality of the space is high (perhaps
much higher than 12) but one cannot simply write down a way of
manipulating the elements as a simple vector. In that event methods of
searching for optima in high dimensional spaces come into play --
conjugate gradient methods or genetic optimization methods or simulated
annealing methods. A classic example is the travelling salesman
problem, where there are e.g. 12 cities that must be visited in some
order and the goal is to minimize a cost function associated with each
pairwise connection between cities. The space of possible solutions
scales rather poorly with the number of cities, and finding a >>good<<
solution isn't easy or particularly likely with random search methods,
let alone finding the >>best<< solution.
GA's or simulated annealing both work much better; gradient methods are
likely to be irrelevant if the cost functions have no continuous
differential relationships to be exploited. In physics, related
problems are e.g. the spin glass, where one seeks a configuration of
spins that minimizes a free energy function where a random "cost
function" is the sign of the spin coupling between nearest neighbor
lattice sites.
Both of these problems are known to be "difficult" in that finding
methods that scale well with the number of spins or cities is not easy,
and both of them are >>very<< difficult from the point of view of being
able to find "the" best solution instead of an "acceptably good
solution" (solution methods are almost invariably stochastic, so the
best one can give is a probability that a given solution is the best one
unless an exhaustive search is completed).
It isn't completely clear that your problem is in this latter category
of computational complexity, but if it is you'll need to real a whole
lot about search and optimization methodology for high dimensional
spaces before proceeding.
Hope this helps you begin to get a grip on the problem. I may be way
off base in my understanding of what you want to do -- in that event,
please describe the problem in more detail.
rgb
--
Robert G. Brown http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567 Fax: 919-660-2525 email:rgb at phy.duke.edu
More information about the Beowulf
mailing list