[Beowulf] are there any known attempts to apply hadoop BigData techniques to weather modelling?

Ellis H. Wilson III ellis at cse.psu.edu
Tue Feb 17 14:16:53 PST 2015

On 02/17/2015 04:56 PM, Prentice Bisbal wrote:
> Why do you think 'Big Data' techniques would be applicable to this?
> A large amount of data != big data.

Heh.  Let's not pretend like 'big data' means anything of substance now :D.

> 'Big Data' techniques are typically for finding trends in unstructured
> data from multiple sources, whereas the output of scientific simulations
> is usually from a single source in some sort of structured format. I
> just don't see any applicability here whatsoever.

I would argue this is perhaps a bit overly specific.  This might be the 
typical use case, but certainly there is no reason why Hadoop and 
MapReduce couldn't be used to do simple filtering of scientific 
simulation output.  If you were looking for places in a huge output file 
where temperature is between some set of ranges and elevation also had a 
specific value, I could certainly see value in applying an easily 
programmable scaling framework to basically "smart grep" through your 
data.  Hadoop/MR could certainly help you do that.

Many output formats for scientific data are well-structured as you 
mentioned however, such as HDF5.  This doesn't mean you have a good file 
system or good parallel programming paradigm to do stupid-simple things 
with this afterwards.  You just have a good container format.  Hadoop 
could provide the other bits you need.  A paper from the HDF5 group 
actually does a decent job of pointing out these kinds of differences, 
how you might get HDF5 containers in and out of HDFS and what impacts 


As they note in the paper, a recent work (I was lucky enough to talk in 
the same slot as the author at SC a year back) called SciHadoop works 
directly with NetCDF formatted files, so that could be another option. 
Whether or not the source is available for SciHadoop is beyond my 
knowledge, but a quick google would likely give you that answer.

If you are asking, "should I do weather simulation using Hadoop or some 
other big data framework," my answer is a resounding NO.  There are VERY 
different (often far more limited) semantics and guarantees in MR than 
other parallel programming paradigms, and you will almost certainly get 
burned if you try to shove a climate-shaped peg through the square hole 
that is MR.  This is probably what Prentice was getting at.



