[Beowulf] hadoop

Tue Nov 27 08:34:12 PST 2012

On Tue, Nov 27, 2012 at 11:13:25AM -0500, Ellis H. Wilson III wrote:

> Are these problems EP such that they could be entirely Map tasks? 

Not at all. This particular application is to derive optimal
feature extraction algorithms from high-resolution volumetric data
(mammal or primate connectome). At ~8 nm, even a mouse will
produce a mountain of structural data.

> Because otherwise you are going to have a fairly significant shuffle 
> stage in your MapReduce application that will lead to overheads moving 
> the data over the network and in and out of memory/disk/etc.  Shuffling 
> can be a real PITA, but it tends to be present in most real-world 
> applications I've run into.

The extracted feature set would be much more compact than the
raw dataset (at least 10^3 to 10^6 more compact), and could 
be loaded over the GBit/s network into the main cluster with 
no problems.

> Maybe you weren't referring to using Hadoop, in which case this 
> basically looks just like the FAWN project I had mentioned in the past 
> that came out of CMU (with the addition of tiered storage).

http://www.cs.cmu.edu/~fawnproj/ ?

Cute, and probably the right application for the
Adapteva project. If the boards are credit-card
sized you can mount them on a rackmount tray 
along with a 24-port switch, with a couple of
fans.

However, I'm thinking about a board you directly plug
your SATA or SAS hard drive into, probably using
the hard drive itself (which should be 5k rpm then) 
as a heatsink.