What could be the performance of my cluster
suraj_peri at yahoo.com
Sat Apr 13 03:21:52 PDT 2002
BLAST ( Basic Local Alignment Search tool) takes the
query ( either protein or DNA) sequence and try to
match the small pathces ( lets say it breaks your
sequence in to small pieces of 6 letters and then try
to match them in a the database index file) .
Once BLAST algo. finds any small match it tries to
extend your query sequence for further match in the
database. If it finds more then it makes a score and
represent that score. If it doesnt then it represents
low score and based on low scores we do not consider
lower score hits.
Thus, in my opinion it does many claculations and
finally show the scores. ( P-value)
Interestingly , BLAST is considered a local alignment
search tool because it tries to match bits of your
query sequence and then extends for more matches.
in contrast there is another algorithm called FASTA (
Fast alignment search tool ) this is a global ( means
it takes big chunks of sequences and then tries to
thread them over database).
So Bill Pearson (creator) made a PVM version of FASTA
and his students at virginia are using it on a beowulf
( You can access that at
In my case my database would be ~80 GB. ( i hope to
use this much data over NFS)
I am planning to introduce this algorithm in every
node and then using MPICH I would like to ask my node
to access the whole database using NFS.
I am new to this area, but I wonder the ideas I am
having are practical or not. We will start configuring
our cluster some time in May.
--- Robert Depenbrock <robert at bay13.de> wrote:
> Greg Lindahl wrote:
> Hi Greg,
> > On Fri, Apr 12, 2002 at 11:15:52AM -0600, Craig
> Tierney wrote:
> > > Is the BLAST code something that spends lots
> > > of time trying doing lots of little
> > > or doing one big calculation? How important is
> > > the speed of access to the database? What is
> > > the memory footprint of the code when it runs
> > > on the DS20E?
> > It depends.
> > What BLAST does is compare a set of sequences
> against a big database of
> > sequences. The databases come in small, medium,
> and large (bigger than
> > 2 GByte) sizes; the sequences can either be a
> single sequence (imagine
> > a researcher looking up a single protein using a
> web interface) or a
> > large set of them. If it's a large set, the
> problem is embarrassingly
> > parallel.
> > The BLAST implementation used by most people isn't
> parallel. It can be
> > fairly easily parallelized to divide the big
> database up into pieces.
> > People build fairly different clusters to run
> BLAST depending on their
> > details. The guys at Celera Geonmics didn't want
> to use a parallel
> > version, and their database is bigger than 2
> GBytes, so they bought
> > Alphas. Most people have small enough databases to
> fit into 2 GBytes,
> > but search against 1 sequence at a time, so they
> can't afford to read
> > the entire database over NFS every time, and keep
> it on a local disk.
> Do you have some sample proteins and databases ?
> I would like to test some machines i have availble
> to mess around a
> little bit.
> (HP PA-Risc Series, SUN Sparc Fire, Itanium, Power
> I would like to build a little benchmark around
> these datasets.
> Robert Depenbrock
> nic-hdl RD-RIPE
> e-mail: robert at bay13.de
> Fingerprint: 1CEF 67DC 52D7 252A 3BCD 9BC4 2C0E
> AC87 6830 F5DD
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or
> unsubscribe) visit
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
More information about the Beowulf