[Beowulf] copying big files (Henning Fehrmann)

David Mathog mathog at caltech.edu
Fri Aug 8 10:55:19 PDT 2008


> I will say that my dream would be for something like dolly to get some
sort
> of transfer recovery mechanism, though I realize that would be quite
> difficult in such a topology. 

nettee has some failover and continuation capabilities at different
points - but not what I think you want. The development version has a
few extra modes for cases where data is being merged, but that isn't
relevant to this discussion. When setting up the initial chain nettee
can connect to an alternate node (from a list of failovers) if the
target node will not answer.  It also has the ability to keep going if
the local disk becomes unwritable, and it can continue a download on a
chain down to the node above the point of failure. 

However, nettee cannot at present rewire around a failed node to
continue a download to the node(s) below it.  That would indeed be quite
difficult, since one could have a situation like this:

  A -> B  (A knows it has sent 100MB)
  B -> C  (B knows it has sent  98MB, then it blows up)
  C       (C knows it has received 98 MB)

A and C will eventually figure out that B has died, and they could
conceivably negotiate a new connection, but A may no longer have the
missing 2 MB (it might have been sent out a pipe, processed, and not
stored in the raw state anywhere.)  On the other hand, the development
version uses ring buffers, and one could set those to be very large,
enabling a certain level of "redo" from A.  So if C comes back and says
"I only have 98MB" A can see if it has the missing parts and go on if it
does.  It still might not though.  If B has stalled for long enough
the ring buffer on A may have completely filled from the previous node,
overwriting the data needed to recover.  I guess it would be possible to
implement a "safety region" in the ring buffer which could not be
overwritten.

> 
> As an aside, I know that the dolly author (Felix) reads this list.  I
assume
> dolly itself is now unmaintained?

AFAIK

Regards,

David Mathog
mathog at caltech.edu
Manager, Sequence Analysis Facility, Biology Division, Caltech



More information about the Beowulf mailing list