Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] copying big files (Henning Fehrmann)

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Henning Fehrmann henning.fehrmann at aei.mpg.de
Mon Aug 11 09:45:02 PDT 2008


Hi,

I found some time to play with dolly and nettee. They do what I was
looking for. Thank you for the hints.

> > I will say that my dream would be for something like dolly to get some
> sort
> > of transfer recovery mechanism, though I realize that would be quite
> > difficult in such a topology. 
> 
> nettee has some failover and continuation capabilities at different
> points - but not what I think you want. The development version has a
> few extra modes for cases where data is being merged, but that isn't
> relevant to this discussion. When setting up the initial chain nettee
> can connect to an alternate node (from a list of failovers) if the
> target node will not answer.  It also has the ability to keep going if
> the local disk becomes unwritable, and it can continue a download on a
> chain down to the node above the point of failure. 
> 
> However, nettee cannot at present rewire around a failed node to
> continue a download to the node(s) below it.  That would indeed be quite
> difficult, since one could have a situation like this:
> 
>   A -> B  (A knows it has sent 100MB)
>   B -> C  (B knows it has sent  98MB, then it blows up)
>   C       (C knows it has received 98 MB)
> 
> A and C will eventually figure out that B has died, and they could
> conceivably negotiate a new connection, but A may no longer have the
> missing 2 MB (it might have been sent out a pipe, processed, and not
> stored in the raw state anywhere.)  On the other hand, the development
> version uses ring buffers, and one could set those to be very large,
> enabling a certain level of "redo" from A.  So if C comes back and says
> "I only have 98MB" A can see if it has the missing parts and go on if it
> does.  It still might not though.  If B has stalled for long enough
> the ring buffer on A may have completely filled from the previous node,
> overwriting the data needed to recover.  I guess it would be possible to
> implement a "safety region" in the ring buffer which could not be
> overwritten.
> 

I spread successfully a 10G file to 50 nodes. The rate was 140Mb/s for nettee and a bit slower using  dolly.
I guess it was due to a busy node somewhere in the chain.  
Increasing the number of clients up to 100 failed in both cases.

For nettee I got:
nettee: fatal error writing to child: Connection reset by peer

for dolly:
Sent MB: 40, MB/s: 66.752, Current MB/s: 35.710      movebytes
read/write: Connection
 reset by peer
         errno = 104

I will do more systematic test the next days. 
David Mathog, are you interested in bug reports?

Cheers,
Henning Fehrmann



More information about the Beowulf mailing list