[Beowulf] fast file copying
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Bill Broadley bill at cse.ucdavis.eduFri May 4 01:48:50 PDT 2007
- Previous message: [Beowulf] compiler/hardware issue fixed
- Next message: [Beowulf] fast file copying
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Geoff Galitz wrote: > > Hi folks, > > During an HPC talk some years ago, I recall someone mentioned a tool > which can copy large datasets across a cluster using a ring topology. > Perhaps someone here knows of this tool? Not sure about a ring topology, seems kinda silly... why not bit-torrent? It's opensource, extremely common, and already integrated into at least one cluster distribution. There's a zillion implementation, your favorite language is likely to have a few (at least for python, c, and java). I've installed a 170+ node rocks cluster in 10 minutes or so, the RPMs are distributed by bit-torrent so that it doesn't matter if one node dies as part of the install. Nor does it matter if your node list has some strange mapping to your physical network (which is often what you get when you ask a batch queue for 5% of a cluster). > More to the point, we are pushing around datasets that are about > 1Gbyte. The datasets are pushed out to dozens of nodes all at once and How often? I just bit-torrented a 1GB file to 165 nodes in 3 minutes, 1.5 minutes was the lazy why I launched it (the last node didn't start until 1.5 minutes into the run). BTW, 140 or so of those nodes already had 1 job per CPU running. > we foresee saturating the I/O system on our cluster as we grow. We are > limited to using just the available disks and are looking for a > reasonable solution that can support this kind of simultaneous access. There are various ways to maximize I/O with bit-torrent. Various seeders allow uploading each block only once (usually called super seeder mode). Assuming you have a few GB ram on the file server you could even prefetch the file before torrenting (i.e. dd if=file_to_server of=/dev/null) since the limit on bit-torrent bandwidth is often how quickly you can seek. Additionally you can make the chunk size larger to reduce the number of seeks. On the client side preallocation can greatly reduce the number of seeks. > Currently we push the data out using rsync, but if I don't get any > better ideas I may simply move to a pull system where the data is > fetched by HTTP. I can get better throttling that way, at least. > If you have a low churn rate you could generate a diff (with rsync) and distribute that via bit-torrent. What kind of per node bandwidths are you hoping for? 1GB sounds really easy unless you have to do it rather often.
- Previous message: [Beowulf] compiler/hardware issue fixed
- Next message: [Beowulf] fast file copying
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
