[Beowulf] Maker2 genomic software license experience?

Thu Nov 8 21:03:14 PST 2012

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 11/08/2012 06:10 AM, Tim Cutts wrote:
> 
> On 8 Nov 2012, at 13:52, Skylar Thompson
> <skylar.thompson at gmail.com> wrote:
> 
>> I guess if your development time is sufficiently shorter than
>> the equivalent compiled code, it could make sense.
> 
> This is true, and a lot of what these guys are writing is pipeline
> glue joining other bits of software together, for which scripting
> languages are perfect.  But there is an element of the "to the man
> with a hammer everything looks like a nail" thing going on, and
> people are writing analysis algorithms in these languages too.
> That's fine for prototyping, but once you run it in production and
> it's going to use thousands of CPU-years, it might be nice if
> occasionally the prototypes were replaced with something that could
> run in hundreds of CPU years instead.  In those cases, investing a
> few extra weeks in implementing in a "harder" language is
> cost-effective.
> 
>> In Genome Sciences here at University of Washington, the grad
>> students are taught Python and R, and there's a number of people
>> who love the Python MPI bindings. We also have some C MPI users,
>> but it's not as popular as Python.
>> 
>> I supposed what you can say is, for the right application, Python
>> MPI certainly is faster than serial Python.
> 
> Maybe, maybe not.  If the problem is embarrassingly parallel, which
> many genomics problems are, often not.  We never adopted MPI-BLAST
> at Sanger, taking an old example, because the throughput was always
> far greater running multiple independent serial BLAST jobs, at
> least in a mixed environment where the BLAST searches weren't
> terribly predictable.
> 
> Plus of course, writing that MPI version of the code is much harder
> to get right than the serial version, so it goes against the
> original argument for keeping the development time short.
> 
> I realise I'm playing devil's advocate here, to a great extent.
> But most genomics that I've dealt with so far is really about high
> throughput, not about short turnaround time of a single analysis
> job.  Of course there are some exceptions, and I'm making far too
> many sweeping generalisations here.
> 
> Tim
> 

This is definitely true. Many of the MPI jobs here are not what many
Beowulfers think of as traditional parallel jobs - they aren't tightly
coupled, instead there's one master rank that farms data-parallel jobs
out to the child ranks, and then does some post-processing when
everything is finished. It could easily be written as a gang of serial
jobs and get the same speedup (or lack of speedup - a perennial
challenge is explaining how slow disks really are).

Skylar
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlCcjpIACgkQsc4yyULgN4Z0xwCgr6zrkXUAmUDrJjuwbB2y2F44
VPEAn2QzzhaLGCOFObLx9r6QHprmCekE
=m1w0
-----END PGP SIGNATURE-----