[Beowulf] Clustering vs Hadoop/spark

Alex Chekholko alex at calicolabs.com
Tue Nov 24 18:31:07 UTC 2020


Hi Jonathan,

You need a "cluster" if your task is too large to run on a single system,
or maybe if it takes too long to run on a single system.

So the primary motivation is practical, not theoretical.

If you can run your task on just one computer, you should always do that
rather than having to build a cluster of some kind and all the associated
headaches.

One aspect of your question seems to be about performance, which is
ultimately limited by the hardware resources.  E.g. a "hadoop cluster"
might be 10 servers each with lots of CPUs and RAM and disks, etc.  For
"hadoop" workloads, typically the bottleneck is I/O so the primary
parameter is the number of disks.  So the programming language is not
really an issue if you're spending all your time waiting on disk I/O.

Regards,
Alex

On Tue, Nov 24, 2020 at 10:22 AM Jonathan Aquilina via Beowulf <
beowulf at beowulf.org> wrote:

> Hi Doug,
>
> So what is the advantage then of a cluster?
>
> Regards,
> Jonathan
>
> -----Original Message-----
> From: Douglas Eadline <deadline at eadline.org>
> Sent: 24 November 2020 18:21
> To: Jonathan Aquilina <jaquilina at eagleeyet.net>
> Cc: beowulf at beowulf.org
> Subject: RE: [Beowulf] Clustering vs Hadoop/spark
>
>
> First I am not a Java expert (very far from it).
>
> Second, Java holds up quite well against Julia as compared to Python. (so
> does Lisp!)
>
>
> https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/julia.html
>
> Some other tings to consider is the underlying Hadoop plumbing is written
> in Java (or Scala for Spark).
> However, it is possible with Hadoop to
> create mapper and reducing functions in any language (text based std
> in/out) Similar to Spark, that can use Python, R, Java, or Scala as a
> front-end.
>
> So there is a bit of decoupling of parallel compute mechanism and compute
> code. (I.e. in all these languages the user does think about cores,
> interconnect, communications, etc) A higher level abstraction.
>
> Much of the early Hadoop performance was based on running large capability
> jobs on gobs of data.
> Jobs that could not be run otherwise (except for Google) So any
> performance was good. Spark come along and says, lets put it all in a
> redundant distributed memory structure.
> Speed up is much faster then traditional Hadoop, so Hadoop creates Tez API
> that does the same thing.
> Performance even out.
>
> Plus analytics jobs are mostly integer. The floating point often comes
> into play when running the models, which is often not a big data problem
> (i.e. don't need a big cluster to run)
>
> --
> Doug
>
> > Hi Doug,
> >
> > Appreciate the clarification where I am not clear is given Hadoop and
> > derivatives are java based where all of this performance all of a
> > sudden comes from. Is it due to where the data resides?
> >
> > At one of my previous jobs I worked with Hadoop through Amazon AWS EMR
> > managed to churn through 5 years' worth of historical data in 1 week.
> > Data being calculations on vehicular tracking data.
> >
> > When I learned java as part of my degree I used to see it as clunky
> > why go for an interpreted language such as java over something more
> > low level like c/c++ on a traditional cluster?
> >
> > Regards,
> > Jonathan
> >
> > -----Original Message-----
> > From: Douglas Eadline <deadline at eadline.org>
> > Sent: 24 November 2020 17:38
> > To: Jonathan Aquilina <jaquilina at eagleeyet.net>
> > Cc: beowulf at beowulf.org
> > Subject: Re: [Beowulf] Clustering vs Hadoop/spark
> >
> >
> >> Hi Guys,
> >>
> >> I am just wondering what advantages does setting up of a cluster have
> >> in relation to big data analytics vs using something like Hadoop/spark?
> >>
> >
> > Long email and the details are important.
> >
> > It all comes down to filesystems and schedulers. But first remember,
> > most Data Analytics projects use many different tools and have various
> > stages that often require iteration and development (e.g. ETL->Feature
> > Matrix->and running models, repeat, and 80% of the work in in first
> > Matrix->two
> > steps) And, many end-users do not use Java map-reduce APIs. They use
> > higher level tools.
> >
> > Filesystems:
> >
> > 1) Traditional Hadoop filesystem (HDFS) is about slicing large data
> > files (or large number of files) across multiple servers, then doing
> > the map phase on all servers at the same time (moving computation to
> > where the data "live", reduce phase requires a shuffle (data movement)
> > and final reduction of data.
> >
> > 2) On-prem HDFS still makes some sense (longer story) however, in the
> > Cloud there is move to using native cloud storage using Apache Ozone FS.
> > You loose the "data locality," but gain all the cloud Kubernettes stuff.
> >
> > 3) Both Hadoop Map-Reduce (mostly Hive RDB applications now) and Spark
> > do "in-memory" map-reduce for performance reasons.
> > In this case, data locality for processing is not as important,
> > However, loading and storing files on large multi-server memory
> > resident jobs still gains from HDFS. Very often Spark writes/reads
> results into Hive tables.
> >
> > Schedulers:
> >
> > 1) Map Reduce scheduling is different than traditional HPC scheduling.
> > The primary Hadoop scheduler is called YARN (Yet Another Resource
> > Negotiator) It has two main features not found in most HPC schedulers,
> > data locality as a resource and dynamic resource allocation.
> >
> > 2) Data locality is about moving jobs to where the data (slice) lives
> > on the storage nodes (hyper-converged storage/compute nodes)
> >
> > 3) Dynamic resource allocation developed because most map-reduce jobs
> > need a lot of containers for map phase, but much-much less for reduce
> > phase, so Hadoop map-reduce can give back resources and ask for more
> > later in other stages of the DAG (multiple map reduce phases are run
> > as a Directed Acyclic Graph)
> >
> > Thus, this model is hard to map on to a traditional HPC cluster.
> > There are map-reduce libraries for MPI. Another way to think about it
> > is Data Analytics is almost always SIMD, all tools language and
> > platforms are optimized to take advantage of map-reduce SIMD operations
> and data flow.
> >
> >
> > --
> > Doug
> >
> >
> >
> >>
> >> Regards,
> >> Jonathan
> >> _______________________________________________
> >> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
> >> Computing To change your subscription (digest mode or unsubscribe)
> >> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
> >>
> >
> >
> > --
> > Doug
> >
>
>
> --
> Doug
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20201124/8aae373f/attachment-0001.htm>


More information about the Beowulf mailing list