[Beowulf] Hadoop's Uncomfortable Fit in HPC

Mon May 19 12:26:17 PDT 2014

> Great write-up by Glenn  Lockwood about the state of Hadoop in HPC. It
> pretty much nails it, and offers an nice overview of the current
> ongoing efforts to make it relevant in that field.
>
> http://glennklockwood.blogspot.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html
>
> Most spot on thing I've read in a while. Thanks Glenn.

I concur. A good assessment of Hadoop. Most HPC users would
benefit from reading Glenn's post. I would offer
a few other thoughts (and a free pdf).

First, I have recently co-authored a book on Apache Hadoop
YARN (i.e. Hadoop V2). As a hardcore HPC dude, it was an
interesting exercise.

The most important thing to remember about Hadoop is that it
is changing and evolving and is not necessarily
synonymous with MapReduce. Hadoop V2 (with YARN) is more of a
general purpose "cluster OS" on which applications frameworks
can be built. Yes it is written in Java, and yes HDFS seems down
right weird, and yes it seems to have some different ways of
doing things, but there are valid reasons for the design.

To help understand Hadoop, I recommend reading the first chapter
of the book (available for free) as it provides history and rational
for Hadoop development (I think this is the first time it has
been carefully written down) You can get a free pdf copy of this
chapter from:

  http://ptgmedia.pearsoncmg.com/images/9780321934505/samplepages/0321934504.pdf

The goal of Hadoop is analysis of all types and forms
of large unrelated data sets. The term  "Hadoop data lake" is
becoming a new buzzword.

As you might imagine, a massive "data lake" is where all organizational
data is placed (dumped, copied, archived) for processing.  Some of the
processing might be batch oriented MapReduce, or real-time MapReduce
with Apache Tez, or graph processing using Apache Giraph, or
in-memory processing using Apache Spark, or even MPI (though
not optimal), or any other framework you care to create (in any
language you desire, maybe with a little Java glue). And the frameworks
can use Hadoop services such as data locality or dynamic run-time
resource allocation/de-allocation.

Personally, I see parts of the Hadoop ecosystem creeping into
HPC. Intel's work with Lustre and Hadoop is an example
of Hadoop tools accessing the "scientific data lake". I think
this will happen as needed and where it makes sense based on the
data size and available tools.

I see a stronger migration of HPC methods into the
"non-HPC data lakes." Things like Hadoop V2 now make this
possible.

Notice that other than this sentence,
I did not use the term "Big Data" in  this email.

--
Doug

> Cheers,
> --
> Kilian
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> Mailscanner: Clean
>

--
Doug

-- 
Mailscanner: Clean