[Beowulf] Torrents for HPC
Ellis H. Wilson III
ellis at cse.psu.edu
Tue Jun 12 10:56:22 PDT 2012
On 06/08/12 20:06, Bill Broadley wrote:
> A new user on one of my GigE clusters submits batches of 500 jobs that
> need to randomly read a 30-60GB dataset. They aren't the only user of
> said cluster so each job will be waiting in the queue with a mix of others.
With a 160TB cluster and only a 30-60GB dataset, is there any reason why
the user isn't simply storing their dataset in HDFS? Does the data
change frequently via a non-MapReduce framework such that it needs to be
pulled from NFS before every job? If the dataset is in a few dozen
files and in HDFS in the cluster, there is no reason why MapReduce
shouldn't spawn it's tasks directly "on" the data, without need (most of
the time) for moving all of the data to every node as you mention.
> The clients definitely see MUCH faster performance when access a local
> copy instead of a small share of the performance/bandwidth of a central
> file server.
This makes perfect sense, and is in fact exactly what Hadoop already
attempts to do by trying to co-locate MapReduce tasks with pre-placed
data in HDFS. Hadoop tries to move the computation to the data in this
case, rather than what you are trying to do: Move the data to the
computation, which tends to be /way/ harder unless you've got killer
All of this said, it is unclear from your email whether this user is
using Hadoop or if that was just a side-node and they are operating in a
totally different cluster with a different framework (MPI?).
More information about the Beowulf