[Beowulf] Clustering vs Hadoop/spark

Jonathan Aquilina jaquilina at eagleeyet.net
Tue Nov 24 16:45:05 UTC 2020

Hi Doug,

Appreciate the clarification where I am not clear is given Hadoop and derivatives are java based where all of this performance all of a sudden comes from. Is it due to where the data resides?

At one of my previous jobs I worked with Hadoop through Amazon AWS EMR managed to churn through 5 years' worth of historical data in 1 week. Data being calculations on vehicular tracking data.

When I learned java as part of my degree I used to see it as clunky why go for an interpreted language such as java over something more low level like c/c++ on a traditional cluster?


-----Original Message-----
From: Douglas Eadline <deadline at eadline.org> 
Sent: 24 November 2020 17:38
To: Jonathan Aquilina <jaquilina at eagleeyet.net>
Cc: beowulf at beowulf.org
Subject: Re: [Beowulf] Clustering vs Hadoop/spark

> Hi Guys,
> I am just wondering what advantages does setting up of a cluster have 
> in relation to big data analytics vs using something like Hadoop/spark?

Long email and the details are important.

It all comes down to filesystems and schedulers. But first remember, most Data Analytics projects use many different tools and have various stages that often require iteration and development (e.g. ETL->Feature Matrix->and running models, repeat, and 80% of the work in in first two steps) And, many end-users do not use Java map-reduce APIs. They use higher level tools.


1) Traditional Hadoop filesystem (HDFS) is about slicing large data files (or large number of files) across multiple servers, then doing the map phase on all servers at the same time (moving computation to where the data "live", reduce phase requires a shuffle (data movement) and final reduction of data.

2) On-prem HDFS still makes some sense (longer story) however, in the Cloud there is move to using native cloud storage using Apache Ozone FS.  You loose the "data locality," but gain all the cloud Kubernettes stuff.

3) Both Hadoop Map-Reduce (mostly Hive RDB applications now) and Spark do "in-memory" map-reduce for performance reasons.
In this case, data locality for processing is not as important, However, loading and storing files on large multi-server memory resident jobs still gains from HDFS. Very often Spark writes/reads results into Hive tables.


1) Map Reduce scheduling is different than traditional HPC scheduling.
The primary Hadoop scheduler is called YARN (Yet Another Resource
Negotiator) It has two main features not found in most HPC schedulers, data locality as a resource and dynamic resource allocation.

2) Data locality is about moving jobs to where the data (slice) lives on the storage nodes (hyper-converged storage/compute nodes)

3) Dynamic resource allocation developed because most map-reduce jobs need a lot of containers for map phase, but much-much less for reduce phase, so Hadoop map-reduce can give back resources and ask for more later in other stages of the DAG (multiple map reduce phases are run as a Directed Acyclic Graph)

Thus, this model is hard to map on to a traditional HPC cluster.
There are map-reduce libraries for MPI. Another way to think about it is Data Analytics is almost always SIMD, all tools language and platforms are optimized to take advantage of map-reduce SIMD operations and data flow.


> Regards,
> Jonathan
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin 
> Computing To change your subscription (digest mode or unsubscribe) 
> visit https://beowulf.org/cgi-bin/mailman/listinfo/beowulf


More information about the Beowulf mailing list