BLAST for beowulf

Mon Apr 23 06:44:20 PDT 2001

There is another commercial version of BLAST that was not mentioned -
TurboBLAST from Turbogenomics. TurboBLAST solves many of the objectives
outlined by Christopher.

1) I want to run a lot of BLAST queries in batches.
TurboBLAST performs batch-queuing so there is no need for LSF or PBS.

2) I want more speed on a single BLAST query.
TurboBLAST partitions input sequences (i.e. queries) and databases to
overcome memory limitations.

3) I have a BIG DNA database to search through.
Same as above

4) I want to set up a web-interface BLAST service on a cluster for
users.

TurboBLAST comes with web-interface for BLAST similar to the NCBI BLAST
interface at http://www.ncbi.nlm.nih.gov/blast/

- Ed

_____________________________________
Ed Pillsbury
TurboGenomics, Inc.
265 Church Street
New Haven, CT  06510
www.turbogenomics.com

----- Original Message -----
From: "Christopher Hogue" <hogue at mshri.on.ca>
To: "gregory j pryzby" <greg at pryzby.org>
Cc: <beowulf at beowulf.org>
Sent: Thursday, April 19, 2001 8:44 PM
Subject: Re: BLAST or wu-blast for beowulf?

> Hi folks
>
> Sorry that I haven't found time to answer this before, been very busy
> setting up our new company here in Toronto.
>
> The short answer is that so far there are only commercial
> implementations available (www.computefarm.com or www.sgi.com), or run
> BLAST with PBS and script it yourself, or set up the www-based cgi's for
> BLAST and run them behind a load balancer.
>
>
> The long "archive-quality" answer follows...
>
> BLAST or WU-BLAST are bioinformatics applications that compare protein
> or DNA sequences to databases with DNA or proteins to find
> similarities.  The original programs are highly optimized for
> multiprocessor machines like Sun and SGI boxes upon which they were
> originally developed.
>
> The BLAST executables (original, non clustered versions) are at
> ftp://ncbi.nlm.nih.gov/blast
> and WU-BLAST is at
> http://blast.wustl.edu/
>
>
> When we refer to BLAST jobs, we call them a "query" which is one
> sequence being compared to one database.
>
> There are several issues about running BLAST on a cluster, and different
> implementation objectives - The answer is it depends on what you want
> clustered BLAST to do!  These vary quite a bit, and require different
> implementations.
>
> Here's some examples of what your objectives might be:
> 1) I want to run a lot of BLAST queries in batches.
> 2) I want more speed on a single BLAST query.
> 3) I have a BIG DNA database to search through.
> 4) I want to set up a web-interface BLAST service on a cluster for
> users.
>
> In all cases, the implementation also needs scripts to do the daily
> updating of databases stored on the local node hard disks.  Figure on
> doing some work here, PERL helps.
>
> I address these situations:
>
> 1)  I want to run a lot of BLAST queries.
>
> Then you want a compute farm approach.  Many people use load sharing
> software like LSF or PBS to execute BLAST on compute farms. You will
> also need to make scripts to ftp download and update the databases on
> all the nodes as a regular process or a cron job.
>
> 2)  I want more speed on a single BLAST query.
>
> BLAST becomes I/O bound very quickly on an SMP machine, and doesn't
> really scale that well on a cluster for a single query.  It is already
> multithreaded. Amdahl's law gets you very quickly in BLAST if you try
> interprocess communication as a model for speeding it up, so forget it.
> So if you want speed, add memory, faster CPUs or more of them, or chunk
> the database into pieces (see 3 below).  I suggest to run multithreaded
> BLAST on dual CPU nodes with sufficient disk and memory to store the
> databases.  Remember to use the processor number argument too, it needs
> to be told how many CPUs to run on.
>
> 3) I have a BIG DNA database to search through and must partition it.
>
> People who use BLAST on protein databases have smaller memory
> requirements than those using BLAST on DNA databases.  The DNA databases
> are much larger, and in commercial compaines can be up to several 10's
> of Gigs. Companies often set up SMP machines with lots of RAM as BLAST
> servers, and they are typically not Linux boxes.
>
> Databases that don't fit in memory often cause the computers to thrash,
> esp. if you have multiprocessor machines running.  e.g a dual cpu node
> with 128Gb RAM with two processes running will thrash horribly on a
> large DNA database as each thread competes to load the database chunk it
> is working on into the same block of memory.
>
> BLAST uses memory-mapped I/O, so that multiple instances can use the
> same data in memory, and it works best when the whole database fits in
> memory and multiple processes can have at it.
>
> Blackstone computing (www.computefarm.com) makes a clustered commercial
> version of BLAST that operates, apparently using a redeployment of
> memory-mapped I/O.  It seems to broadcast to the cluster that it is
> looking for a file when looking for a piece of a database, and it grabs
> any copy of that file already in memory BLAST databases from another
> node through a socket.  So it does a memory-memory transfer rather than
> a disk-memory transfer.  I have not tried this implmentation.  It also
> may require some heavy scripting to break up the BLAST databases to
> match your cluster node size and memory.  Figure doing this on a daily
> update cycle.
>
> Anyhow, the memory-mapped I/O trick is an interesting one that could be
> implemented at the LINUX kernel level somewhere, I think as a general
> purpose cluster utility.
>
> 4) I want to set up a web-interface BLAST server.
>
> This is a common desire, but is not really a cluser issue.  A good
> single CPU machine can do this nicely for a few casual users, again put
> enough RAM in it.
> Look here for precompiled executables for the CGI versions of BLAST.
> ftp://ncbi.nlm.nih.gov/blast/server/
> They have executable cgi's for Linux, Tru64, SGI and Solaris.
> If you set these up with a load-balancer on several nodes, you may have
> what you are looking for for more users.
>
>
> Christopher Hogue, Ph.D.
> CIO MDS Proteomics
> http://www.mdsproteomics.com
>
> On leave from the Samuel Lunenfeld Research Inst. Mt. Sinai, Toronto.
> http://bioinfo.mshri.on.ca
>
>
> gregory j pryzby wrote:
> >
> > I am looking for infromation (w/o much success) to see if there is a
> > version of BLAST that will run on a beowulf cluster.
> >
> > --
> > greg pryzby                      greg at pryzby dot org
> > ach tee tee pee colon slash slash pryzby dot org slash
> > fingerprint: 8A1A DB90 869F 5DD1 D6E9 EEB6 C156 6B04 849F A86F
> >
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org
> > To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf
>