[Beowulf] Beowulf Cluster VS Hadoop/Spark

John Hanks griznog at gmail.com
Fri Dec 30 00:03:06 PST 2016


As I thought about that I decided it's worth expanding on stereotypical MPI
user vs. stereotypical Spark user. In general if I ask each about the I/O
pattern of their codes, I'll get:

MPI user: "We open N files for read, M files for write, typically Y% of
this is random r/w with some streaming at stage ....."

Spark user: "What does 'I/O mean? I didn't see that in the 'Spark for
Budding Data Scientists' tutorial I just finished earlier today..."

The "data science" area has some maturing to do which should be exciting
and fun for all of us :)

jbh

On Fri, Dec 30, 2016, 10:47 AM John Hanks <griznog at gmail.com> wrote:

> This often gets presented as an either/or proposition and it's really not.
> We happily use SLURM to schedule the setup, run and teardown of spark
> clusters. At the end of the day it's all software, even the kernel and OS.
> The big secret of HPC is that in a job scheduler we have an amazingly
> powerful tool to manage resources. Once you are scheduling spark clusters,
> hadoop clusters, VMs as jobs, containers, long running web services, ....,
> you begin to feel sorry for those poor "cloud" people trapped in buzzword
> land.
>
> But, directly to your question what we are learning as we dive deeper into
> spark (interest in hadoop here seems to be minimal and fading) is that it
> is just as hard or maybe harder to tune for than MPI and the people who
> want to use it tend to have a far looser grasp of how to tune it than those
> using MPI. In the short term I think it is beneficial as a sysadmin to
> spend some time learning the inner squishy bits to compensate for that. A
> simple wordcount example or search can show that wc and grep can often
> outperform spark and it takes some experience to understand when a
> particular approach is the better one for a given problem. (Where better is
> measured by efficiency, not by the number of cool new technical toys were
> employed :)
>
> jbh
>
> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina <jaquilina at eagleeyet.net>
> wrote:
>
> Hi All,
>
> Seeing the new activity about new clusters for 2017, this sparked a
> thought in my mind here. Beowulf Cluster vs hadoop/spark
>
> In this day and age given that there is the technology with hadoop and
> spark to crunch large data sets, why build a cluster of pc's instead of use
> something like hadoop/spark?
>
>
> Happy New Year
>
> Jonathan Aquilina
>
> Owner EagleEyeT
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
-- 
‘[A] talent for following the ways of yesterday, is not sufficient to
improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20161230/6ac3788c/attachment.html>


More information about the Beowulf mailing list