[Beowulf] Slide on big data

Tim Cutts tjrc at sanger.ac.uk
Wed Feb 19 08:55:50 PST 2014

An aspect of big data that I think is a better description is not what the data is, or how large it is, or how it’s structured, but how you ask questions of it and the rationale for which it was gathered.

In the traditional scientific model, you formulate a hypothesis, you design an experiment to test that hypothesis.  Your experiment generates some data, which could be tiny or could be ridiculously huge.  But nevertheless, the data was gathered specifically to answer that question, and is probably not terribly useful for anything else.

Big data analysis by contrast does something different; you gather a large amount of data without any particular hypothesis in mind, or you pick some dataset that was gathered for some other purpose, and you simply look for ways the data can be clustered or organised, and then try to determine whether that tells you something interesting.

Using my own field as an example, genomics is definitely moving in that direction.  When I started looking for genetic variation associated with disease 20 years ago, sequencing was very expensive, so the typical hypothesis model was used; we have a bunch of candidate genes which are plausibly involved in cancer, diabetes, or whatever our condition of interest is.  We looked for variations specifically in those genes, and determined whether they associate with the condition.

Now, we use a much more “big data” approach.  We perform whole genome sequencing of thousands of individuals, without any hypothesis as to what might or might not be involved, and we let statistical analysis show us where the associations are.  What’s more, once a genome’s been sequenced for one project, it’s equally useful for any other association study that might be of interested (ethical and consent issues notwithstanding).

So perhaps whether your questioning of the data is hypothesis driven in the traditional sense is the criterion.


Dr Tim Cutts
Acting Head of Scientific Computing
Wellcome Trust Sanger Institute

 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140219/fec2002c/attachment.html>

More information about the Beowulf mailing list