OT: computationally hard or not (algorithm question)

Mark Hahn hahn at coffee.psychology.mcmaster.ca
Fri Mar 23 15:00:07 PST 2001


> Consider an n-dimensional array containing something on the order of
> 10^4 (i.e. thens of thousands) elements, where each number may assume an

10K elements is a fairly small number.  especially when the space
is 12^12 (density is ~ 1/900K).

> Now, take any one of those dots, search for the 10 (some figure on this
> order of magnitude) dots which are closest to it, and order those by

why 10?

> 8916100448256. I guess you'd have to find some efficient shortest-pair
> algorithm that works for n dimensions (my algorithm book only has one
> that works for 2), find the shortest pair, remove it from the array,
> then find the next value etc..

do you really want to extract pairs like this?  it sounds a bit like
a dendrogram (from clustering).

the obvious place to look is any algorithms book (kD trees), and 
possibly also the clustering literature.  sorting the dimensions
based on their span or variance might be a smart heuristic, too.

I've been thinking of something similar: clustering ~25600-dimensional data.
the data are actually 400-sample timecourses recorded from 64 scalp
electrodes, and generally we'd have a few thousand of them to cluster.
(actually just rejecting outliers would probably be of immediate practical
interest...)

there's also a large body of literature on analyzing similarity in
data with many dimensions, each of which is 26 deep, and in analyzing
data with many, many dimensions, each of which is 4 deep.  
(text and genetics if you didn't guess ;)

> dimensions if it turns out that I can't afford the hardware) when I want
> the result set within a couple of seconds at the _very_ most?

the enclosed program run in around 7 seconds on my cheap duron/600,
and is a fairly obvious implementation of what you described.
it's also O(N^2) in the number of elements, but linear in the 
length of each string.  the next obvious step would be to use a 
kD tree, or something a little smarter than the quadratic rescanning
of the array.

regards, mark hahn.
-------------- next part --------------
#include <stdlib.h>
#include <myio.H>
#include <math.h>
#include <vector.H>

typedef char atom;
static inline double sq(atom a) { return double(a)*double(a); }

const unsigned itemLength = 12;

class Item {
    atom *data;
  public:
    Item() {
	data = new atom[itemLength];
	for (unsigned i=0; i<itemLength; i++)
	    data[i] = (atom) (12 * drand48());
    }
    friend double distance(const Item &a, const Item &b) {
	double sum = 0;
	for (unsigned i=0; i<itemLength; i++) {
	    atom d = a.data[i] - b.data[i];
	    sum += sq(d);
	}
	return sum;
    }
    friend ostream operator<<(ostream &os, const Item &item) {
	for (unsigned i=0; i<itemLength; i++)
	    os << char('a' + item.data[i]);
	return os;
    }
};

int
main() {
    unsigned itemCount = 10000;
    Vector<Item> items(itemCount);

    for (unsigned a=0; a<items.n; a+=2) {
	unsigned bestItem;
	double bestScore = DBL_MAX;
	for (unsigned b=a+1; b<itemCount; b++) {
	    double d = distance(items[a], items[b]);
	    if (d < bestScore) {
		bestScore = d;
		bestItem = b;
	    }
	}
//	cout << items[a] << ' ' << items[bestItem]
//	     << ": score is " << bestScore << endl; 

	// swap bestItem with last item.
	Item t = items[a+1];
	items[a+1] = items[bestItem];
	items[bestItem] = t;
    }

    return 0;
}


More information about the Beowulf mailing list