[Beowulf] hadoop

Tue Nov 27 09:03:25 PST 2012

On Tue, Nov 27, 2012 at 11:50:00AM -0500, Ellis H. Wilson III wrote:

> > Not at all. This particular application is to derive optimal
> > feature extraction algorithms from high-resolution volumetric data
> > (mammal or primate connectome). At ~8 nm, even a mouse will
> > produce a mountain of structural data.
> 
> Pardon my possible naiveté on the applied science here, but it's unclear 
> to me why the state space explosion is tied to it being embarrassingly 
> parallel or not.  Perhaps, to reword my question, can you describe if, 

The problem is the huge amount of data. You need to be able to process
10-100 PByte data sets within reasonable times -- say, overnight.

> and if, at what frequency, the extraction algorithms will need to 
> barrier sync to communicate?  If this is indeed "not at all" EP, then 

You will only need to communicate the surface of adjacent
volumetric set. Due to scaling (volume vs surface of a cube)
the total information exchanged is negligible, and localized.

> you will likely have a serious communication problem and 1GbE will not 
> work if you need to transmit some or all of the data you are reading 
> locally to some other remote node.

No, the CPU will be mostly occupied 100% with processing a data
stream coming in with 100-200 MByte/s from each spindle.

> How are you getting the raw data onto the cluster?  This time may become 

The data would come from a scanning instrument, or rather,
multiple scanning instruments working in parallel on presectioned
blocks of tissue. Probably using
http://en.wikipedia.org/wiki/Serial_block-face_scanning_electron_microscopy
but probably with focused ion beam milling
http://www.fei.com/uploadedfiles/documents/content/serial-section-sem-fib.pdf
it can probably run 1-3 years until you have your full
data set from a cm^3 sample.

> the dominant one if it is not a write-once read-very-many type of 

It's exactly read-mostly, only the result will be written into
a scratch area of the disk, and then can be uploaded into
the main cluster.

> situation.  Maybe you have lots of different feature extraction 
> algorithms to use on that raw data?

Exactly, tuning parameters for feature extraction is a black art,
so there will be plenty of experimental tuning.

> >> Maybe you weren't referring to using Hadoop, in which case this
> >> basically looks just like the FAWN project I had mentioned in the past
> >> that came out of CMU (with the addition of tiered storage).
> >
> > http://www.cs.cmu.edu/~fawnproj/ ?
> 
> Yep, that's the one.
> 
> > Cute, and probably the right application for the
> > Adapteva project. If the boards are credit-card
> > sized you can mount them on a rackmount tray
> > along with a 24-port switch, with a couple of
> > fans.
> >
> > However, I'm thinking about a board you directly plug
> > your SATA or SAS hard drive into, probably using
> > the hard drive itself (which should be 5k rpm then)
> > as a heatsink.
> 
> Why do you want the HDD to be a heatsink (i.e. why is that better in any 
> way than just having the HDD right there and using a normal passive 

Because you want to put a PByte in a rack, and also be able to process
it where it is. Almost no space for extra hardware, which would also
impede the airflow in the vertically arranged drives, like
http://www.thattommyhall.com/wp-content/uploads/2008/04/thumper2.jpg

> sink)?  And can you expound upon the differences between the FAWN setup 
> if it had a HDD saddled right next to it against what you are 
> describing?  I feel like you're saying the exact same thing except just 
> connect a HDD for capacity reasons and use the onboard flash for cache 
> instead, both of which are reasonably trivial.
> 
> Just trying to get a handle on your (interesting IMHO) idea here, no 
> non-constructive criticism intended,