[Beowulf] High Performance for Large Database

Wed Oct 27 12:52:06 PDT 2004

> I'm also very interested in just what sort of symbolic manipulation you
> are working on.

My numerical/symbolic mix underlying my opinions is from natural
language processing, mostly speech recognition. Involves training
phase which uses huge amount of recorded speech which is iteratively
turned into estimated statistical distributions of phoneme sounds
(multivariate gaussians with some 500.000 parameters, work for the
FPU) and huge amount of text turned into dictionaries and grammar
rules (symbolic and maybe even SQL). This phase is not very
beowulfish, processes can work locally for minutes.

Then there is the recognition phase when we match unknown utterances
against our models of sounds and pronunciation and dictionaries and
grammar and this is very beowulfish as we need to estimate zillions of
partial hypothesis and compose them together somehow, likely in real
time, and we are happy to pass quick messages around and keep most
things in aggregated cluster RAM.

Training on huge speech data has very much the pattern just described
by Mark Hahn:

> depends.  for instance, it's not *that* uncommon to have DB's which
> see almost nothing but read-only queries (and updates, if they happen
> at all, can be batched during an off-time.)  that makes a parallel
> version quite easy

(thought we do not have wav files in SQL :-) ) and we are much
interested in ways to divide our data to chunks cached on local
harddisks on nodes and repeatedly processed again and again (say 30
times during one itarative process, and we try many variants of this
process on the same data.)

Of course we have just one cluster for both things, so it constantly
switches between being a beowulf and not being a beowulf :-)

Best Regards

Vaclav Hanzl