[Beowulf] PetaBytes on a budget, take 2

Thu Jul 21 11:55:30 PDT 2011

On 07/21/11 14:29, Greg Lindahl wrote:
> On Thu, Jul 21, 2011 at 12:28:00PM -0400, Ellis H. Wilson III wrote:
> 
>>  For traditional Beowulfers, spending a year or two developing custom
>> software just to manage big data is likely not worth it.
> 
> There are many open-souce packages for big data, HDFS being one
> file-oriented example in the Hadoop family. While they generally don't
> have the features you'd want for running with HPC programs, they do
> have sufficient features to do things like backups.

I'm actually doing a bunch of work with Hadoop right now, so it's funny
you mention it.  My experience with and understanding of Hadoop/HDFS is
that it is really more geared towards actually doing something with the
data once you have it on storage, which is why it's based of off google
fs (and undoubtedly why you mention it, being in the search arena
yourself).  As purely a backup solution it would be particularly clunky,
especially in a setup like this one where there's a high HDD to CPU ratio.

My personal experience with getting large amounts of data from local
storage to HDFS has been suboptimal compared to something more raw, but
perhaps I'm doing something wrong.  Do you know of any distributed
file-systems that are geared towards high-sequential-performance and
resilient backup/restore?  I think even for HPC (checkpoints), there's a
pretty good desire to be able to push massive data down and get it back
over wide pipes.  Perhaps pNFS will fill this need?

ellis