[Beowulf] Beowulf Cluster VS Hadoop/Spark

Douglas Eadline deadline at eadline.org
Fri Dec 30 12:41:29 PST 2016


> I suspect that you can take any hadoop/spark application and give it to a
> good C/C++/OpenMp/MPI coder and in six months, a year, two years,..., you
> will end up with a much faster and much more efficient application.
> Meanwhile the original question the application was answering very likely
> won't matter to those who originally used hadoop/spark to answer it.

There are some problems that may merit programmer costs of that
magnitude, then there are others that can tolerate a little
inefficiency because the application works and scales now.

>
> It's worth keeping in mind that a lot (maybe most) of the "big data"
> analysis being done is done in Microsoft Excel.

Marketing aside, they call it big data for a reason.
Performing data analytics on a laptop is perfectly feasible,
and a good way to get a feel for you data. If you need to
turn your app loose on Tbytes of real data generated each day
then you may want to use something a bit heftier.

> Python and R cover a big
> chunk of it, often running on laptops.

PySpark runs on a laptop and if you data grows to where you need
to scale, PySpark scales.

--
Doug

>I assume anyone reading a post on
> this list, like me, suffers from "cluster bias" which causes us to forget
> that the bulk of computational work taking place in the world happens
> outside, far outside, of the top 100 machines. And much of that work is
> done by people who care more about the total time to solution and will
> happily trade a little additional CPU time for a better, easier and more
> powerful abstraction to use to ask questions.
>
> Consider also the increasing amount of this work being done by training a
> deep learning framework after which the researcher may or may not be able
> to explain how/why the thing works. Port that to C :)
>
> In general I always get suspicious at any suggestion of a pure approach to
> anything. As with "centralization" efforts in the world of IT, a pure
> approach is often code for "arbitrary boundaries for you that we are
> comfortable with." Hadoop/spark are great data exploration tools and
> someone who understands their data and knows Python can do wonderful
> things
> in a Python notebook backed by an appropriately sized spark cluster and
> then be off to the next question before "hello world" can be compiled in
> C.
> I for one welcome our new big data overlords, unless they demand to run
> Excel on the cluster.
>
> Good thing it's close to my bedtime, I have exhausted my daily buzzword
> quota.
>
> jbh
>
> On Fri, Dec 30, 2016, 11:00 AM Jonathan Aquilina <jaquilina at eagleeyet.net>
> wrote:
>
>> Thanks John for your reply. Very interesting food for thought here. What
>> I
>> do understand between hadoop and spark is that spark is intended, i
>> could
>> be wrong here, as a replacement to hadoop as it performs better and
>> faster
>> then hadoop.
>>
>> Is spark also java based? I never thought java to be so high performant.
>> I
>> know when i started learning to program in java (java6) it was slow and
>> clunky. Wouldnt it be better to stick with a pure beowulf cluster and
>> build
>> yoru apps in c or c++ something that is closer to the machine language
>> then
>> the use of an interpreted language such as java? I think where I fall
>> short
>> to understand is how with hadoop and spark have they made java so quick
>> compared to a compiled language.
>>
>>
>>
>> On 2016-12-30 08:47, John Hanks wrote:
>>
>> This often gets presented as an either/or proposition and it's really
>> not.
>> We happily use SLURM to schedule the setup, run and teardown of spark
>> clusters. At the end of the day it's all software, even the kernel and
>> OS.
>> The big secret of HPC is that in a job scheduler we have an amazingly
>> powerful tool to manage resources. Once you are scheduling spark
>> clusters,
>> hadoop clusters, VMs as jobs, containers, long running web services,
>> ....,
>> you begin to feel sorry for those poor "cloud" people trapped in
>> buzzword
>> land.
>>
>> But, directly to your question what we are learning as we dive deeper
>> into
>> spark (interest in hadoop here seems to be minimal and fading) is that
>> it
>> is just as hard or maybe harder to tune for than MPI and the people who
>> want to use it tend to have a far looser grasp of how to tune it than
>> those
>> using MPI. In the short term I think it is beneficial as a sysadmin to
>> spend some time learning the inner squishy bits to compensate for that.
>> A
>> simple wordcount example or search can show that wc and grep can often
>> outperform spark and it takes some experience to understand when a
>> particular approach is the better one for a given problem. (Where better
>> is
>> measured by efficiency, not by the number of cool new technical toys
>> were
>> employed :)
>>
>> jbh
>>
>> On Fri, Dec 30, 2016, 9:32 AM Jonathan Aquilina
>> <jaquilina at eagleeyet.net>
>> wrote:
>>
>> Hi All,
>>
>> Seeing the new activity about new clusters for 2017, this sparked a
>> thought in my mind here. Beowulf Cluster vs hadoop/spark
>>
>> In this day and age given that there is the technology with hadoop and
>> spark to crunch large data sets, why build a cluster of pc's instead of
>> use
>> something like hadoop/spark?
>>
>>
>>
>> Happy New Year
>>
>> Jonathan Aquilina
>>
>> Owner EagleEyeT
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>> --
>> '[A] talent for following the ways of yesterday, is not sufficient to
>> improve the world of today.'
>>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>>
>> --
> ‘[A] talent for following the ways of yesterday, is not sufficient to
> improve the world of today.’
>  - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC
>
> --
> Mailscanner: Clean
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug

-- 
Mailscanner: Clean



More information about the Beowulf mailing list