[Beowulf] Clustering vs Hadoop/spark [EXT]

Tim Cutts tjrc at sanger.ac.uk
Wed Nov 25 09:26:37 UTC 2020



On 24 Nov 2020, at 18:31, Alex Chekholko via Beowulf <beowulf at beowulf.org<mailto:beowulf at beowulf.org>> wrote:

If you can run your task on just one computer, you should always do that rather than having to build a cluster of some kind and all the associated headaches.


If you take on the cloud message, that of course isn’t necessarily the case.  If you use very high level cloud services like lambda, you don’t have to build that infrastructure.  It’s very unlikely to be anywhere near as efficient, of course, but throughput efficiency is not what your average scientist cares about.  What they care about is getting their answer quickly (and to a lesser extent, cheaply)

I saw a recent example where someone took a fairly simple sequencing read alignment process, which normally runs on a single 16-core node in about 6 hours, and split the input files small enough that the alignment code execution time and memory use would fit with AWS Lambda’s envelope.  The result executed in a couple of minutes, elapsed, but used about four times as many core-hours as the optimised single node version.  Of course, this is an embarrassingly parallel problem, so this is a relatively easy analysis to move to this sort of design.

From the scientist’s point of view, which is better?  Getting their answer in 5 minutes or 6 hours?  Especially if they’ve also reduced their development time as well because they don’t have to worry so much about infrastructure and optimisation.

The total value is hard to work out, many of these considerations are hard to put a dollar value on.  When I saw that article, I did ask the author how much the analysis actually cost, and she didn’t have a number.  But I don’t think we can dogmatically say that we should always run a task on a single machine if we can.

Tim



-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20201125/d456bf2a/attachment.htm>


More information about the Beowulf mailing list