[Beowulf] Torrents for HPC

Fri Jun 8 17:06:19 PDT 2012

I've built Myrinet, SDR, DDR, and QDR clusters ( no FDR yet), but I 
still have users whose use cases and budgets still only justify GigE.

I've setup a 160TB hadoop cluster is working well, but haven't found 
justification for the complexity/cost related to lustre.  I have high 
hopes for Ceph, but it seems not quite ready yet.  I'd happy to hear 
otherwise.

A new user on one of my GigE clusters submits batches of 500 jobs that 
need to randomly read a 30-60GB dataset.  They aren't the only user of 
said cluster so each job will be waiting in the queue with a mix of others.

As you might imagine that hammers a central GigE connected NFS server 
pretty hard.  This cluster has 38 computes/304 cores/608 threads.

I thought torrent might be a good way to publish such a dataset to the 
compute nodes (thus avoiding the GigE bottleneck).  So I wrote a 
small/simple bittorrent client and made a 16GB example data set and 
measured the performance pushing it to 38 compute nodes:
     http://cse.ucdavis.edu/bill/btbench-2.png

The slow ramp up is partially because I'm launching torrent clients with 
a crude for i in <compute_nodes> { ssh $i launch_torrent.sh }.

I get approximately 2.5GB/sec sustained when writing to 38 compute 
nodes.  So 38 nodes * 16GB = 608GB to distribute @ 2.5 GHz sec = 240 
seconds or so.

The clients definitely see MUCH faster performance when access a local 
copy instead of a small share of the performance/bandwidth of a central 
file server.

Do you think it's worth bundling up for others to use?

This is how it works:
1) User runs publish <directory> <name> before they start submitting
    jobs.
2) The publish command makes a torrent of that directory and starts
    seeding that torrent.
3) The user submits an arbitrary number of jobs that needs that
    directory.  Inside the job they "$ subscribe <name>"
4) The subscribe command launches one torrent client per node (not per j
    job) and blocks until the directory is completely downloaded
5) /scratch/<user>/<name> has the users data

Not nearly as convenient as having a fast parallel filesystem, but seems 
potentially useful for those who have large read only datasets, GigE and 
NFS.

Thoughts?