deadline at eadline.org
Sat Feb 7 08:33:52 PST 2015
I understand your confusion. Hadoop and Big Data have reached
overused but not well understood status years ago.
First, Hadoop started out at a MapReduce engine. This all
changed with Hadoop V2 and YARN (Yet Another Resource Negotiator)
Hadoop V2 can be considered a platform on which applications that need
parallel access to large amounts of unstructured data (i.e. raw data not
in a traditional database. It can also used with its own database HBase,
which is based on Google Big Table.
The idea is this, a "Hadoop" cluster has a large amount of storage
using HDFS (or possibly another parallel filesystem) This is often referred
to as the "Data Lake." Raw data is dumped in the lake. There is no
ETL (Extract Transform and Load) step. Various Hadoop YARN frameworks use
this data. YARN provides a very dynamic resource allocation model and the
ability to provide data locality to your application (i.e. the traditional
MapReduce idea was "move the computation to the data")
Thus in a Hadoop V2 cluster you can have MapReduce applications (which
support many of the the popular apps like Pig and Hive) It also supports
Spark, Storm, Giraph and even MPI (not the most efficient but it works)
There are many other applications being ported to YARN.
Second, Big Data is usually defined by Volume, Velocity, and Variety.
The definition seems to be what ever a vendor wants it to be, however.
It reminds me of products that suddenly became "grid ready" in years past.
Again such designations mean as much as "now works with binary data"
Finally, if you are interested in Hadoop YARN you can check out the book
"Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with
Apache Hadoop 2" (I helped write it). There also many online resources.
The first chapter of the book has the history of Hadoop as written by
one of the developers. It is quite interested to read and helps dispel
many of the Hadoop myths. You can read this chapter for free here:
That is enough Hadoop for Saturday morning. Oh, and Hadoop clusters
are not going to supplant your HPC cluster.
> Can someone explain to me what exactly the purpose of hadoop is and what
> we mean when we say big data? Is this for data storage and retrieval?
> Number crunching?
> Jonathan Aquilina
> Founder Eagle Eye T
> Mailscanner: Clean
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
More information about the Beowulf