OT: computationally hard or not (algorithm question)

Sat Mar 24 14:59:05 PST 2001

On Sat, 24 Mar 2001, Frank Joerdens wrote:

> That sounds very interesting indeed! To spill the beans about what I'm
> up to (it's a little embarassing, seeing that you supercomputing people
> are into fairly serious stuff mostly; protein sequences, physics, data
> mining etc.), it's an online community/game where I want to give users
> the option to find "similar" avatars to theirs. Similarity is a
> massively complex notion, psychologically, and ideally this would be a
> problem that you'd attack via some AI strategy (expert systems, or
> systems that "learn" similarity), but the process needs to be _fast_: If
> a user comes to the site, outfits her/his avatar with attributes and
> then hits the go button to look for potential buddies, something around
> a second would be acceptable. This is bearing in mind that there might
> be dozens of those queries running more or less simultaneously, or in
> quick succession. Hence my thinking that you'd define a set of
> properties according to which every attribute is rated. Consider
> clothing: Every item would get a rating according to coolness, elegance
> and sportiness (we played that through; this would be _way_ too narrow,
> you'd need a quite a few more differentiated categories), which means
> you're looking at a 3D space where each dimension might be only 3
> dimensions deep (2 = very, 1 = kind of, 0 = not at all). Actually, I am
> thinking now that I could do with a depth of only 3 or 4 but that I'd
> still need around a dozen dimensions (we haven't fixed the property set
> yet).
>
> Do you have a starting point from where to dive into this body of
> literature by any chance?

Similarity is my business.  Read about Parzen-Bayes classifiers -- this
isn't exactly what you want (because you don't have discrete
identifiable classes) but the notion of a classification/similarity
metric is developed there.  The next interesting idea to explore would
be using a neural network.  One can build a kind of network that
identifies a given individual out of a crowd, and then "score" the crowd
with the network to find near misses.  Expensive on a per-person basis,
but depending on how hard you want to work you might be able to
distribute the time.  This would also parallelize very nicely -- train
nets for each person in parallel on many hosts, coarsely aggregate them
by score, and perhaps identify a classification schema a posteriori.

The same kind of thing is VERY useful in e.g. Web Business (or any
business) identifying (potential) customers "like" your best customers,
for example.

> I don't read C too well but as far as I could make out, the program you
> enclosed (great!) implements Robert's first suggestion to
>
> -------------------------- begin quote --------------------------
> to pick an element (or a point), evaluate the distance of each element
> in the elements from the element or point (storing the result in e.g.
> distance[i]) and sort the indices of distance (not the list itself).
> -------------------------- end quote --------------------------
>
> I'll try that!

Read about Parzen-Bayes first.  This is still they way you'd want to
store the data; PB will just help you work out a way of creating a
"good" similarity metric.

   rgb

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu