What could be the performance of my cluster

Fri Apr 12 10:43:35 PDT 2002

On Fri, Apr 12, 2002 at 11:15:52AM -0600, Craig Tierney wrote:

> Is the BLAST code something that spends lots
> of time trying doing lots of little calculations,
> or doing one big calculation?  How important is
> the speed of access to the database?  What is
> the memory footprint of the code when it runs
> on the DS20E?

It depends.

What BLAST does is compare a set of sequences against a big database of
sequences. The databases come in small, medium, and large (bigger than
2 GByte) sizes; the sequences can either be a single sequence (imagine
a researcher looking up a single protein using a web interface) or a
large set of them. If it's a large set, the problem is embarrassingly
parallel.

The BLAST implementation used by most people isn't parallel. It can be
fairly easily parallelized to divide the big database up into pieces.

People build fairly different clusters to run BLAST depending on their
details. The guys at Celera Geonmics didn't want to use a parallel
version, and their database is bigger than 2 GBytes, so they bought
Alphas. Most people have small enough databases to fit into 2 GBytes,
but search against 1 sequence at a time, so they can't afford to read
the entire database over NFS every time, and keep it on a local disk.

greg