[Beowulf] Big storage
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Loic Tortay tortay at cc.in2p3.frFri Sep 14 00:32:54 PDT 2007
- Previous message: [Beowulf] Big storage
- Next message: [Beowulf] Big storage
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
According to Leif Nixon: [...] > > > > There are two reasons: > > . ZFS has built-in error detection (through "zpool scrub") and we are > > (maybe naively) relying on this to detect and correct data corruption > > which would be otherwise silent; > > It *would* be interesting to see if the ZFS checksumming lives up to > its promises. > During the last HEPiX meeting, Peter Kelemen mentionned something told to him by a ZFS developer (Jeff Bonwick, if I'm not mistaken) about data corrupted by a Fibre Channel HBA during transfer between disk and host. ZFS, reportedly, detected (and corrected) the corruption. Of course a ZFS developer may be biased. I'm probably mis-remembering some of the technical details about this, since they seem quite unlikely now (something about the laser beam being somehow "corrupted", but I think this would be detected by the Fibre Channel link protocols or upper layers checksums). The technical explanation was probably more akin to data corruption during DMA transfer from the HBA to the host memory. If you remember some of the figures Peter gave, most of the corruptions they found were not random/spontaneous. A very large majority was due to buggy hard disk firmware and another significant part to a batch of defective memory. His slides are available there: <https://indico.desy.de/contributionDisplay.py?contribId=65&sessionId=42&confId=257>. [...] > > I still think it would be interesting to see how often one gets data > corruption from other sources than disk errors (presuming ZFS is > perfect). Data corruption is data corruption even if its from bad > cache memory. > Indeed, data corruption is data corruption wherever it comes from. But since fsprobe writes its own data to disk, it can't test for corruptions on data (and metadata) which is already stored, leaving whole part of the disks untested on machines were files are static (system disks, program binaries, archives, etc.) On the other hand, ZFS has a "parity check on read" feature which should be able to detect these corruptions. If the data is corrupted during the transfer from disk to memory or in memory before it's moved to userland it will (supposedly) be spotted by ZFS. If the data is corrupted in memory on a machine on which we use ZFS, then the machine is badly failing since they're all supposed to have ECC memory. Loïc. -- | Loïc Tortay <tortay at cc.in2p3.fr> - IN2P3 Computing Centre |
- Previous message: [Beowulf] Big storage
- Next message: [Beowulf] Big storage
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
