[Beowulf] Beowulf Cluster VS Hadoop/Spark

Douglas Eadline deadline at eadline.org
Fri Dec 30 07:28:07 PST 2016


> Thanks John for your reply. Very interesting food for thought here. What
> I do understand between hadoop and spark is that spark is intended, i
> could be wrong here, as a replacement to hadoop as it performs better
> and faster then hadoop.

Please allow me to let some air out of this idea that Spark is
a replacement for Hadoop. when I give a presentation on the subject
I always start with:

   Hadoop V2 >= MapReduce

Which means, Hadoop V2 is now a *platform* for analytics and not
the monolithic MapReduce engine of V1. As such, it supports, MapReduce,
Spark, Giraph, MPI, and other parallel modes. (efficiency is
another topic).

More people are using Spark instead of native MapReduce programming
(of course, it is easier) Tools like Hive SQL (built on top of MapReduce)
are just as fast as Spark in many cases (all the shiny
Spark comparisons use old batch based Hive numbers)

Hadoop as a platform is usually based on HDFS (Hadoop distributed
file system, note not a parallel file system). Other filesystems can
be used (e.g. Lustre or Ceph) Hadoop uses a workflow
scheduler called YARN (Yet Another Resource Negotiator) that
has two features not found in HPC schedulers: run-time dynamic allocation
and data locality as a resource (when using HDFS).

Most analytics projects use a collection of tools, one of which
can be Spark, but MapReduce tools like Pig and Hive are also very
important in things like data munging and preparation.

Spark can be run as part of a Hadoop platform or (I assume) under control
of an HPC type scheduler. The issue becomes tricky when you want to
provide a fully robust analytics platform (like Hortonworks, the
Red Hat of Hadoop) and an HPC platform. (Something I plan
to look into this year)

Long email for the end of the year, so Happy New Year everyone and
remember

 Hadoop V2 >= MapReduce



>
> Is spark also java based?


Scala, but it can use a Python front end (PySpark)


--
Doug

>

I never thought java to be so high performant.
> I know when i started learning to program in java (java6) it was slow
> and clunky. Wouldnt it be better to stick with a pure beowulf cluster
> and build yoru apps in c or c++ something that is closer to the machine
> language then the use of an interpreted language such as java? I think
> where I fall short to understand is how with hadoop and spark have they
> made java so quick compared to a compiled language.
>
> On 2016-12-30 08:47, John Hanks wrote:
>
>> This often gets presented as an either/or proposition and it's really
>> not. We happily use SLURM to schedule the setup, run and teardown of
>> spark clusters. At the end of the day it's all software, even the kernel
>> and OS. The big secret of HPC is that in a job scheduler we have an
>> amazingly powerful tool to manage resources. Once you are scheduling
>> spark clusters, hadoop clusters, VMs as jobs, containers, long running
>> web services, ...., you begin to feel sorry for those poor "cloud"
>> people trapped in buzzword land.
>>
>> But, directly to your question what we are learning as we dive deeper
>> into spark (interest in hadoop here seems to be minimal and fading) is
>> that it is just as hard or maybe harder to tune for than MPI and the
>> people who want to use it tend to have a far looser grasp of how to tune
>> it than those using MPI. In the short term I think it is beneficial as a
>> sysadmin to spend some time learning the inner squishy bits to
>> compensate for that. A simple wordcount example or search can show that
>> wc and grep can often outperform spark and it takes some experience to
>> understand when a particular approach is the better one for a given
>> problem. (Where better is measured by efficiency, not by the number of
>> cool new technical toys were employed :)
>>
>> jbh
>>
>> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina
>> <jaquilina at eagleeyet.net> wrote:
>>
>>> Hi All,
>>>
>>> Seeing the new activity about new clusters for 2017, this sparked a
>>> thought in my mind here. Beowulf Cluster vs hadoop/spark
>>>
>>> In this day and age given that there is the technology with hadoop and
>>> spark to crunch large data sets, why build a cluster of pc's instead of
>>> use something like hadoop/spark?
>>>
>>> Happy New Year
>>>
>>> Jonathan Aquilina
>>>
>>> Owner EagleEyeT _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>>> Computing
>>> To change your subscription (digest mode or unsubscribe) visit
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> --
>>
>> '[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.'
>> - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
> --
> Mailscanner: Clean
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


--
Doug

-- 
Mailscanner: Clean



More information about the Beowulf mailing list