[Beowulf] Slide on big data

Mark Hahn hahn at mcmaster.ca
Tue Feb 18 22:58:10 PST 2014


> Pardon me, what exactly IS Big Data :)

ya smiley, but I think it's worth trying to put words to it.

mostly, I think BD is really "Many Data": it's not really
about the absolute scale.  If I run a big simulation that 
writes 10 TB of checkpoints every cycle, that's reasonably
large data.  in a sense, I've got just one unit of data per node,
so not really "many".  Or if I'm doing lookups in some giant
business DB - the tables may be quite large, but I'm probably
doing low-cardinality selects and joins (indices FTW!).

in a sense, you have BG when your data and performance controls
the design of your clusters.  you may have a very trad DB that's
implemented across more than one node, but it's probably not
a thousand nodes with gigabit - the latter is probably BD.

I often think of BD and Data Mining as being quite closely linked.
But I don't think I'd want to say that all BD is for DM...

regards, mark hahn.



More information about the Beowulf mailing list