[Beowulf] Big Data and HPC

Mark Hahn hahn at mcmaster.ca
Wed Jun 11 21:27:16 PDT 2014

> As we're involved in HPC and life sciences there are people who deal
> with vast amounts of data, but with bioinformatics tools, not with
> Hadoop et. al.

BD: it's not about the size, it's how you use it ;)

> So, do Python/Perl/Java/R count as Big Data tools in the survey, or
> not? :-)

not.  if BD is to have any useful meaning, I think it should only
refer to the kind of "retrospective" processing which gives us such
memorable examples.  Target detecting pregnant customers, for instance:
it's not that Target has a lot of sales records, or a lot of customers,
but rather that they extract something by looking back over the history,
combining multiple data sources, to gain some "emergent" insight.

I think of BD as a kind of analysis of "repurposed" data. 
or maybe "meta-analytics".

it's not that bioinformatics never does this kind of analysis,
just that most bioinfo data is collected for a purpose - maybe 
a sample's sequence to compare with a DB of known sequences.
on the other hand, imagine if you crunch on your lab's historic 
collection of sequences from machine A and discover that it has 
produced data that is statistically different from all the data 
collected from machine B in the same lab.  maybe fragments are 
consistently shorter, or somehow B just likes to produce sequences
begining with ATATAGA.  (I'm totally making this up, of course.)
the latter kind of analysis seems much more BDish to me.

I'm sure there are a gazillion BDish analyses that could, and are
being done on even the standard sequence banks (oh, look whenever 
we see this sequence for hairy toes, it correlates with slightly 
pointy ears, modest stature and a fondness for 6 meals a day...)

regards, mark.

