[Beowulf] copying big files (Henning Fehrmann)
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Henning Fehrmann henning.fehrmann at aei.mpg.deMon Aug 11 09:45:02 PDT 2008
- Previous message: [Beowulf] copying big files (Henning Fehrmann)
- Next message: [Beowulf] LAM w/HPC Challenge: ld.so interface ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi,
I found some time to play with dolly and nettee. They do what I was
looking for. Thank you for the hints.
> > I will say that my dream would be for something like dolly to get some
> sort
> > of transfer recovery mechanism, though I realize that would be quite
> > difficult in such a topology.
>
> nettee has some failover and continuation capabilities at different
> points - but not what I think you want. The development version has a
> few extra modes for cases where data is being merged, but that isn't
> relevant to this discussion. When setting up the initial chain nettee
> can connect to an alternate node (from a list of failovers) if the
> target node will not answer. It also has the ability to keep going if
> the local disk becomes unwritable, and it can continue a download on a
> chain down to the node above the point of failure.
>
> However, nettee cannot at present rewire around a failed node to
> continue a download to the node(s) below it. That would indeed be quite
> difficult, since one could have a situation like this:
>
> A -> B (A knows it has sent 100MB)
> B -> C (B knows it has sent 98MB, then it blows up)
> C (C knows it has received 98 MB)
>
> A and C will eventually figure out that B has died, and they could
> conceivably negotiate a new connection, but A may no longer have the
> missing 2 MB (it might have been sent out a pipe, processed, and not
> stored in the raw state anywhere.) On the other hand, the development
> version uses ring buffers, and one could set those to be very large,
> enabling a certain level of "redo" from A. So if C comes back and says
> "I only have 98MB" A can see if it has the missing parts and go on if it
> does. It still might not though. If B has stalled for long enough
> the ring buffer on A may have completely filled from the previous node,
> overwriting the data needed to recover. I guess it would be possible to
> implement a "safety region" in the ring buffer which could not be
> overwritten.
>
I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using dolly.
I guess it was due to a busy node somewhere in the chain.
Increasing the number of clients up to 100 failed in both cases.
For nettee I got:
nettee: fatal error writing to child: Connection reset by peer
for dolly:
Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710 movebytes
read/write: Connection
reset by peer
errno = 104
I will do more systematic test the next days.
David Mathog, are you interested in bug reports?
Cheers,
Henning Fehrmann
- Previous message: [Beowulf] copying big files (Henning Fehrmann)
- Next message: [Beowulf] LAM w/HPC Challenge: ld.so interface ?
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
