[Beowulf] Big storage

Fri Sep 14 09:01:30 PDT 2007

According to Bruce Allen:
[...]
> >>
> > We are not using fsprobe on our X4500.
> >
> > There are two reasons:
>
> <SNIP>
>
> I still think that the results would be interesting.
>
> In response to the reasons you gave:
>
> [1] I agree that if ZFS + hardware works as it is supposed to, there will
> not be any corruption.  But it would be nice to prove this via experiment.
>
I agree that it would be nice to be able to prove ZFS effectiveness
experimentally.

In my opinion, the only valid experiment would be to explicitely corrupt
data on a disk taken out of a zpool and reading data from that disk
after putting the disk back into the server.

The fact that fsprobe does not detect corruption does not imply there is
no corruption.
The problem is that fsprobe doesn't really test the whole device, only
some of the parts that do not contain data at the time of the test
(since it creates new files in existing filesystems).

Some of our X4500s have mostly read-only content (like many of our
Xrootd servers), fsprobe will not have the opportunity to test the disk
parts that contain data (and this is also where we want to detect
silent data corruptions).

Don't get me wrong, I agree that fsprobe is useful and that some
"spontaneous" data corruption happens in the wild.
We have been hit a few times by data corruptions due to disk firmware,
RAID controllers, filesystems bugs or non fatal failures or sometimes
unkown causes.
I even developed a simple program similar to fsprobe a few years ago in
order to detect active data corruption on disk servers based on 3ware
RAID controllers.

But the expectations in terms of system buffer cache, "active" data 
blocks life expectancy and so on that make fsprobe useful mostly do not 
apply or are too expensive on the X4500s in my environment.

Even if ZFS error detection (and "zpool scrub") would be proved not to
work, most of the programs and data format used include and verify data
checksums/hash values so, *in our environment*, they would be as likely
to find a problem as fsprobe is (they would probably be more likely
since they access a larger amount of data) and at a lower cost.

Therefore, I consider running fsprobe on our X4500s a non optimal use
of these ressources.

Like most (or all I guess) LCG Tier-1s and sites of similar or larger
size, we have something that does essentially the same thing as
fsprobe (write/read/compare/report differences), except it's only used
for acceptance/burn testing of new hardware.

>
> [2] You can probably force writes to disk by simply writing files too
> large to fit into the memory cache.  Or modify fsprobe (or ask Peter to
> modify it) so that it fsync()s after writes rather than using the direct
> IO to bypass the device block buffer layer.
>
fsprobe already has an option for using "fsync(2)" and "fdatasync(2)"
when writing files ('--Sync').
But that's just to make sure the data are actually sent to disk, not
for making sure they won't be read from the cache.
The fsprobe source includes a comment just about that and the
assumptions they make ('checkFile' function around line 426).

Writing a very large file to evict data from the cache will have a
significant impact on the other processes on the machine, since
one of the aims of fsprobe is to be run on live machines running
"production" services.

>
> In any case by the end of the year I should have at least ten X4500s, and
> can do some testing myself.  But your collection is an order of magnitude
> larger, so you can collect much more useful statistics.  If those
> statistics show no data corruption, then someone like myself with many
> fewer systems can be very confident that no silent corruption is occuring.
>
fsprobe is a Linux only program, it doesn't compile out of the box on
Solaris (even 10+).
I have a simple patch to make it compile and run on Solaris though, if
someone is interested.

Loïc.
-- 
| Loïc Tortay <tortay at cc.in2p3.fr> -     IN2P3 Computing Centre     |