[Beowulf] copying big files (Henning Fehrmann)
henning.fehrmann at aei.mpg.de
Mon Aug 11 09:45:02 PDT 2008
I found some time to play with dolly and nettee. They do what I was
looking for. Thank you for the hints.
> > I will say that my dream would be for something like dolly to get some
> > of transfer recovery mechanism, though I realize that would be quite
> > difficult in such a topology.
> nettee has some failover and continuation capabilities at different
> points - but not what I think you want. The development version has a
> few extra modes for cases where data is being merged, but that isn't
> relevant to this discussion. When setting up the initial chain nettee
> can connect to an alternate node (from a list of failovers) if the
> target node will not answer. It also has the ability to keep going if
> the local disk becomes unwritable, and it can continue a download on a
> chain down to the node above the point of failure.
> However, nettee cannot at present rewire around a failed node to
> continue a download to the node(s) below it. That would indeed be quite
> difficult, since one could have a situation like this:
> A -> B (A knows it has sent 100MB)
> B -> C (B knows it has sent 98MB, then it blows up)
> C (C knows it has received 98 MB)
> A and C will eventually figure out that B has died, and they could
> conceivably negotiate a new connection, but A may no longer have the
> missing 2 MB (it might have been sent out a pipe, processed, and not
> stored in the raw state anywhere.) On the other hand, the development
> version uses ring buffers, and one could set those to be very large,
> enabling a certain level of "redo" from A. So if C comes back and says
> "I only have 98MB" A can see if it has the missing parts and go on if it
> does. It still might not though. If B has stalled for long enough
> the ring buffer on A may have completely filled from the previous node,
> overwriting the data needed to recover. I guess it would be possible to
> implement a "safety region" in the ring buffer which could not be
I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using dolly.
I guess it was due to a busy node somewhere in the chain.
Increasing the number of clients up to 100 failed in both cases.
For nettee I got:
nettee: fatal error writing to child: Connection reset by peer
Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710 movebytes
reset by peer
errno = 104
I will do more systematic test the next days.
David Mathog, are you interested in bug reports?
More information about the Beowulf