[Beowulf] Torrents for HPC

Skylar Thompson skylar.thompson at gmail.com
Mon Jun 11 17:34:35 PDT 2012


On 6/8/2012 5:06 PM, Bill Broadley wrote:
> 
> I've built Myrinet, SDR, DDR, and QDR clusters ( no FDR yet), but I 
> still have users whose use cases and budgets still only justify GigE.
> 
> I've setup a 160TB hadoop cluster is working well, but haven't found 
> justification for the complexity/cost related to lustre.  I have high 
> hopes for Ceph, but it seems not quite ready yet.  I'd happy to hear 
> otherwise.
> 
> A new user on one of my GigE clusters submits batches of 500 jobs that 
> need to randomly read a 30-60GB dataset.  They aren't the only user of 
> said cluster so each job will be waiting in the queue with a mix of others.
> 
> As you might imagine that hammers a central GigE connected NFS server 
> pretty hard.  This cluster has 38 computes/304 cores/608 threads.
> 
> I thought torrent might be a good way to publish such a dataset to the 
> compute nodes (thus avoiding the GigE bottleneck).  So I wrote a 
> small/simple bittorrent client and made a 16GB example data set and 
> measured the performance pushing it to 38 compute nodes:
>      http://cse.ucdavis.edu/bill/btbench-2.png
> 
> The slow ramp up is partially because I'm launching torrent clients with 
> a crude for i in <compute_nodes> { ssh $i launch_torrent.sh }.
> 
> I get approximately 2.5GB/sec sustained when writing to 38 compute 
> nodes.  So 38 nodes * 16GB = 608GB to distribute @ 2.5 GHz sec = 240 
> seconds or so.
> 
> The clients definitely see MUCH faster performance when access a local 
> copy instead of a small share of the performance/bandwidth of a central 
> file server.
> 
> Do you think it's worth bundling up for others to use?
> 
> This is how it works:
> 1) User runs publish <directory> <name> before they start submitting
>     jobs.
> 2) The publish command makes a torrent of that directory and starts
>     seeding that torrent.
> 3) The user submits an arbitrary number of jobs that needs that
>     directory.  Inside the job they "$ subscribe <name>"
> 4) The subscribe command launches one torrent client per node (not per j
>     job) and blocks until the directory is completely downloaded
> 5) /scratch/<user>/<name> has the users data
> 
> Not nearly as convenient as having a fast parallel filesystem, but seems 
> potentially useful for those who have large read only datasets, GigE and 
> NFS.
> 
> Thoughts?

We've run into a similar need for a solution at $WORK. I work in a large
genomics research department and we have cluster users who want to copy
large data files (20GB-500GB) to hundreds of cluster nodes at once.

Since the people that need this tend to run MPI anyways, I wrote an MPI
utility that copies a file once to every node in the job, taking care to
make sure each node only gets one copy of the file and to copy the file
only if its SHA1 hash changes.

Skylar



More information about the Beowulf mailing list