BLAST for beowulf
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Ed Pillsbury pillsbury at turbogenomics.comMon Apr 23 06:44:20 PDT 2001
- Previous message: 1 GHz Athlon in 2U Case
- Next message: Debian vs RedHat config question.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
There is another commercial version of BLAST that was not mentioned - TurboBLAST from Turbogenomics. TurboBLAST solves many of the objectives outlined by Christopher. 1) I want to run a lot of BLAST queries in batches. TurboBLAST performs batch-queuing so there is no need for LSF or PBS. 2) I want more speed on a single BLAST query. TurboBLAST partitions input sequences (i.e. queries) and databases to overcome memory limitations. 3) I have a BIG DNA database to search through. Same as above 4) I want to set up a web-interface BLAST service on a cluster for users. TurboBLAST comes with web-interface for BLAST similar to the NCBI BLAST interface at http://www.ncbi.nlm.nih.gov/blast/ - Ed _____________________________________ Ed Pillsbury TurboGenomics, Inc. 265 Church Street New Haven, CT 06510 www.turbogenomics.com ----- Original Message ----- From: "Christopher Hogue" <hogue at mshri.on.ca> To: "gregory j pryzby" <greg at pryzby.org> Cc: <beowulf at beowulf.org> Sent: Thursday, April 19, 2001 8:44 PM Subject: Re: BLAST or wu-blast for beowulf? > Hi folks > > Sorry that I haven't found time to answer this before, been very busy > setting up our new company here in Toronto. > > The short answer is that so far there are only commercial > implementations available (www.computefarm.com or www.sgi.com), or run > BLAST with PBS and script it yourself, or set up the www-based cgi's for > BLAST and run them behind a load balancer. > > > The long "archive-quality" answer follows... > > BLAST or WU-BLAST are bioinformatics applications that compare protein > or DNA sequences to databases with DNA or proteins to find > similarities. The original programs are highly optimized for > multiprocessor machines like Sun and SGI boxes upon which they were > originally developed. > > The BLAST executables (original, non clustered versions) are at > ftp://ncbi.nlm.nih.gov/blast > and WU-BLAST is at > http://blast.wustl.edu/ > > > When we refer to BLAST jobs, we call them a "query" which is one > sequence being compared to one database. > > There are several issues about running BLAST on a cluster, and different > implementation objectives - The answer is it depends on what you want > clustered BLAST to do! These vary quite a bit, and require different > implementations. > > Here's some examples of what your objectives might be: > 1) I want to run a lot of BLAST queries in batches. > 2) I want more speed on a single BLAST query. > 3) I have a BIG DNA database to search through. > 4) I want to set up a web-interface BLAST service on a cluster for > users. > > In all cases, the implementation also needs scripts to do the daily > updating of databases stored on the local node hard disks. Figure on > doing some work here, PERL helps. > > I address these situations: > > 1) I want to run a lot of BLAST queries. > > Then you want a compute farm approach. Many people use load sharing > software like LSF or PBS to execute BLAST on compute farms. You will > also need to make scripts to ftp download and update the databases on > all the nodes as a regular process or a cron job. > > 2) I want more speed on a single BLAST query. > > BLAST becomes I/O bound very quickly on an SMP machine, and doesn't > really scale that well on a cluster for a single query. It is already > multithreaded. Amdahl's law gets you very quickly in BLAST if you try > interprocess communication as a model for speeding it up, so forget it. > So if you want speed, add memory, faster CPUs or more of them, or chunk > the database into pieces (see 3 below). I suggest to run multithreaded > BLAST on dual CPU nodes with sufficient disk and memory to store the > databases. Remember to use the processor number argument too, it needs > to be told how many CPUs to run on. > > 3) I have a BIG DNA database to search through and must partition it. > > People who use BLAST on protein databases have smaller memory > requirements than those using BLAST on DNA databases. The DNA databases > are much larger, and in commercial compaines can be up to several 10's > of Gigs. Companies often set up SMP machines with lots of RAM as BLAST > servers, and they are typically not Linux boxes. > > Databases that don't fit in memory often cause the computers to thrash, > esp. if you have multiprocessor machines running. e.g a dual cpu node > with 128Gb RAM with two processes running will thrash horribly on a > large DNA database as each thread competes to load the database chunk it > is working on into the same block of memory. > > BLAST uses memory-mapped I/O, so that multiple instances can use the > same data in memory, and it works best when the whole database fits in > memory and multiple processes can have at it. > > Blackstone computing (www.computefarm.com) makes a clustered commercial > version of BLAST that operates, apparently using a redeployment of > memory-mapped I/O. It seems to broadcast to the cluster that it is > looking for a file when looking for a piece of a database, and it grabs > any copy of that file already in memory BLAST databases from another > node through a socket. So it does a memory-memory transfer rather than > a disk-memory transfer. I have not tried this implmentation. It also > may require some heavy scripting to break up the BLAST databases to > match your cluster node size and memory. Figure doing this on a daily > update cycle. > > Anyhow, the memory-mapped I/O trick is an interesting one that could be > implemented at the LINUX kernel level somewhere, I think as a general > purpose cluster utility. > > 4) I want to set up a web-interface BLAST server. > > This is a common desire, but is not really a cluser issue. A good > single CPU machine can do this nicely for a few casual users, again put > enough RAM in it. > Look here for precompiled executables for the CGI versions of BLAST. > ftp://ncbi.nlm.nih.gov/blast/server/ > They have executable cgi's for Linux, Tru64, SGI and Solaris. > If you set these up with a load-balancer on several nodes, you may have > what you are looking for for more users. > > > Christopher Hogue, Ph.D. > CIO MDS Proteomics > http://www.mdsproteomics.com > > On leave from the Samuel Lunenfeld Research Inst. Mt. Sinai, Toronto. > http://bioinfo.mshri.on.ca > > > gregory j pryzby wrote: > > > > I am looking for infromation (w/o much success) to see if there is a > > version of BLAST that will run on a beowulf cluster. > > > > -- > > greg pryzby greg at pryzby dot org > > ach tee tee pee colon slash slash pryzby dot org slash > > fingerprint: 8A1A DB90 869F 5DD1 D6E9 EEB6 C156 6B04 849F A86F > > > > _______________________________________________ > > Beowulf mailing list, Beowulf at beowulf.org > > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf >
- Previous message: 1 GHz Athlon in 2U Case
- Next message: Debian vs RedHat config question.
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
