[Beowulf] RE: [Bioclusters] FPGAin bioinformatics clusters (again?)

Tue Jan 17 11:16:23 PST 2006

On Tue, 17 Jan 2006, Mike Davis wrote:

> Robert,
>
> I agree. We want to avoid havbing the OS creating the tree structure 
> necessary to deal with those file sized and avoid stating all of those files. 
> A  user created tree structure would save time, an indexed database would 
> possibly as well.
>
> Basically, each and every sequence needs to be compared and recompared as the 
> assembly runs. In the end, the goal is an assembled genome with no gaps.
>
> Better file structures and/or indexing could both be a big help.
>
> For everyone, these analyses run on a dedicated local filesystem on an SMP 
> Sun with either 32 or 96GB of RAM. The actual genome itself is only a GB at 
> most when assembled.

How big are the files from which you assemble (the individual units in
the "50,000 files" below)?  You've got an awful lot of RAM to buffer
things in, so a parallel readahead on a dual processor system could very
likely eliminate file-based delays altogether if there is a significant
amount of computation (running through all positions looking for base
pair matches?) per file.  At least you could probably get it to where
you were bw limited on the filesystem, which is the best you can ever
get, and eliminate the file seek/stat/open latency.

    rgb

>
>
> Mike Davis
>
>
>
> Robert G. Brown wrote:
>
>> On Mon, 16 Jan 2006, Mike Davis wrote:
>> 
>>> But BLAST is only a small part and argueably the easiest part of genomics 
>>> work. The advantages of parallelization and/or smp come into play when 
>>> attempting to assemble the genome. Phred/Phrap can do the work but starts 
>>> to slow even large machines when your talking 50k+ of sequences (which it 
>>> wants to be in one folder). A quiz for  the Unix geeks out there, what 
>>> happens when a folder has 50,000 files in it. Can you say SLOOOOOOOOOWWWW?
>> 
>> 
>> Right, but that's why God invented databases and tree structures and the
>> like.  The problem is in the software design.  One file with 50K
>> sequences in it will read in very quickly and only has to stat once.
>> 50K sequences in a file apiece will have to stat EACH file, period, and
>> stat is expensive, especially if you're using e.g. NIS or the like
>> naively that adds a network hit per new file reference to check up on
>> all the perms and groups and so on.  It also takes time just to build
>> the kernel's image of inodes if there are 50K of them in a single
>> directory -- it DOES have to be read a file with 50K entries just to
>> start the process of stat'ing them one at a time.  I suspect that this
>> exceeds the capacity of the kernel to smoothly cache the file data as
>> well.
>> 
>> A compromise is to organize the files into a tree (if the problem is a
>> search problem with some sort of ordinal or catagorical data
>> organization) so that there aren't so many inodes to parse at any given
>> level.  Whether or not that wins for you probably depends on the task's
>> organization -- if you can avoid some parts (ideally most!) of the tree
>> altogether it should be a big win.  Looking things up in a tree or a
>> hash is much, much more efficient than looking them up in a linear list.
>> 
>> Of course if the task is just reading in each file, one at a time, and
>> searching all sorts of things WITHIN the file, then tanstaffl -- if the
>> task is dominated by the actual time spent IN the file doing the
>> lookups, then you don't have a lot of room to speed it up no matter how
>> you organize it, although still one big sequential file will be somewhat
>> faster than many smaller sequential files.  If you were really
>> interested in minimizing time, you might be able to do a simple forking
>> off of threads to manage the lookup/stat/open of the next file "in
>> parallel with" the processing of the current file.
>> 
>> As in:
>>
>>   Stat/Open file A (costs perhaps ~1 ms latency)
>>   Read file A into memory (costs filesize/disk bw)
>>   Fork Stat/Open file B   AND   Process file A
>>   (issue int/block/return)      (work during block)
>> 
>> The idea here is that the stat/open thread is likely to issue an
>> interrupt to the disk and then block during the seek, which is
>> responsible for most of the time (although just reading the inode table
>> the first time will be a hit as well, we can hope that thereafter it
>> will be cached).  During the block process control is likely to be
>> returned to the work thread so that it can proceed in parallel with the
>> slow old disk.  This might help "hide" most of the latency and shift the
>> ratio of user to system time a bit further towards user, or might not --
>> never really tried it.
>> 
>> Which won't help at all if you don't have the source code, but hey,
>> that's what open vs closed source code is all about.  If you have the
>> source, you can fix boneheaded design decisions and optimize and improve
>> -- probably beyond even what I suggest here as I'm not a real computer
>> scientist and a real one could probably think of even better ways to
>> proceed.  If you don't have the source, and your only interface to those
>> that do is some sales guy who thinks inodes are a kind of show-off
>> intellectual, interrupts are rude at dinner, and blocks are things his
>> kid plays with at home, well...
>>
>>    rgb
>> 
>>> 
>>> Mike Davis
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Lukasz Salwinski wrote:
>>> 
>>>> Michael Will wrote:
>>>> 
>>>>> I have always been amazed at the promises of massivelyparallel. Now
>>>>> their
>>>>> technique is so good they don't even need the source code to
>>>>> parallelize?
>>>>> 
>>>>> ...but if I tell you how I would have to kill you...
>>>>> 
>>>>> Michael Will 
>>>> 
>>>> 
>>>> 
>>>> uh.. just a quick comment on bioinformatics and parallelizing things...
>>>> 
>>>> please note, that most of the bioinformatic problems are already
>>>> embarrassingly parallel and, with the new genomes showing up at an 
>>>> amazing rate, getting more and more so. Thus, in most cases, it just
>>>> doesn't make much sense to parallelize anything - if one's got to
>>>> run 300x4000 blasts against a library of 300x4000 sequences (ie 300
>>>> genomes, 4000 genes/proteins, all vs all) the simplest solution -
>>>> a lot of nodes, blast optimized for a single cpu and a decent queing
>>>> system will ultimately win (as long as one stays within the same
>>>> architecture; FPGAs are a diferent story ;o)
>>>> 
>>>> lukasz
>>>> 
>>> 
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>> 
>> 
>

-- 
Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu