PVFS fault tolerence ?

Walter B. Ligon III walt at clemson.edu
Wed Feb 6 08:41:21 PST 2002


Ben, the advice given by Jeff Layton is sound.

An alternative if you want to use PVFS for long term storage is to use multiple
drives per IO node with redundancy on the IO node.  That way if you lose
a disk you don't lose data and don't lose availability.  Might also get
a little better performance.  If you lose a NODE (like a MB or NIC) then
the data becomes unavailable until you repair (or replace) the node but
the data is not lost.  Not high availability by any means, but at least

Another thing to consider is not striping all files across all IO nodes.
If you have many IO nodes you can stripe data to a subset of the nodes,
and randomly select which subset on a per file basis.  Thus when a node
goes down, not ALL files are affected - so you don't have to restore them
all while you wait to repair the node.

We currently have in the works an on-top redundancy library for PVFS.
This is a parity-on-close scheme.  The library saves room in the file
for parity (using a RAID4 or RAID5 scheme) and when you close the file,
the parity is calculated and stored.  The parity is not used at all during
normal file reads and writes, so the effect on performance is minimal.
A utility is provided to recover a file if you lose a node by reading the
parity and rebuilding the lost blocks.  Data being written at the time the
node fails is not secure.  In fact files being written at failure time
are not secure - so you should use copies if you can.

Once again, this isn't a general-purpose solution.  There's no fail-over.
But it does make it a lot less likely to lose data.  I expect this to
become available in several weeks - if you are interested let me know and
I'll keep you in mind for alpha and beta testing.

Finally, the current development on PVFS version 2 will include mechanisms
for file redundancy (probably mirroring) supported by the servers (with
fail-over and stuff) and certainly anyone interested in contributing ideas
or man-power to THAT effort should contact us.


> I see that PVFS stripes data, but it seems there is no fault 
> tolerance.  That is, a node goes down, and your data is not available.  So, 
> although IDE drives are pretty reliable, the chance of data loss from a 
> PVFS system is essentially the chance of a single node (disk) going down 
> times the number of nodes in the PVFS file space.  Not so good.  So, yeah, 
> I buy a large capacity tape drive.  But even then, do I have to look at 
> restoring the *entire* PVFS volume, which could be very large (say 20 nodes 
> times 30gb) just for loosing one node?  Odds are,  losing a disk is "when", 
> not "if".
> -Ben Ransom
>   UC Davis

Dr. Walter B. Ligon III
Associate Professor
ECE Department
Clemson University

More information about the Beowulf mailing list