[Beowulf] Beowulf Cluster VS Hadoop/Spark

Douglas Eadline deadline at eadline.org
Fri Dec 30 12:24:28 PST 2016


> As I thought about that I decided it's worth expanding on stereotypical
> MPI
> user vs. stereotypical Spark user. In general if I ask each about the I/O
> pattern of their codes, I'll get:
>
> MPI user: "We open N files for read, M files for write, typically Y% of
> this is random r/w with some streaming at stage ....."

which is how is should be
>
> Spark user: "What does 'I/O mean? I didn't see that in the 'Spark for
> Budding Data Scientists' tutorial I just finished earlier today..."

which is how it should be
>
> The "data science" area has some maturing to do which should be exciting
> and fun for all of us :)

I am curious what you mean by "maturing." The problem space is quite
different and the goals are quire different, which necessitates
different designs. Are you aware the the Hadoop project once used
Torque and Maui as the scheduler? They developed their own
because the needed the ability at run-time to add and subtract
resources (containers) and they needed a way to schedule with
data locality in mind.


--
Doug



>
> jbh
>
> On Fri, Dec 30, 2016, 10:47 AM John Hanks <griznog at gmail.com> wrote:
>
>> This often gets presented as an either/or proposition and it's really
>> not.
>> We happily use SLURM to schedule the setup, run and teardown of spark
>> clusters. At the end of the day it's all software, even the kernel and
>> OS.
>> The big secret of HPC is that in a job scheduler we have an amazingly
>> powerful tool to manage resources. Once you are scheduling spark
>> clusters,
>> hadoop clusters, VMs as jobs, containers, long running web services,
>> ....,
>> you begin to feel sorry for those poor "cloud" people trapped in
>> buzzword
>> land.
>>
>> But, directly to your question what we are learning as we dive deeper
>> into
>> spark (interest in hadoop here seems to be minimal and fading) is that
>> it
>> is just as hard or maybe harder to tune for than MPI and the people who
>> want to use it tend to have a far looser grasp of how to tune it than
>> those
>> using MPI. In the short term I think it is beneficial as a sysadmin to
>> spend some time learning the inner squishy bits to compensate for that.
>> A
>> simple wordcount example or search can show that wc and grep can often
>> outperform spark and it takes some experience to understand when a
>> particular approach is the better one for a given problem. (Where better
>> is
>> measured by efficiency, not by the number of cool new technical toys
>> were
>> employed :)
>>
>> jbh
>>
>> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina
>> <jaquilina at eagleeyet.net>
>> wrote:
>>
>> Hi All,
>>
>> Seeing the new activity about new clusters for 2017, this sparked a
>> thought in my mind here. Beowulf Cluster vs hadoop/spark
>>
>> In this day and age given that there is the technology with hadoop and
>> spark to crunch large data sets, why build a cluster of pc's instead of
>> use
>> something like hadoop/spark?
>>
>>
>> Happy New Year
>>
>> Jonathan Aquilina
>>
>> Owner EagleEyeT
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> --
>> ‘[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.’
>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
> --
> Mailscanner: Clean
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug

-- 
Mailscanner: Clean



More information about the Beowulf mailing list