[Beowulf] strange problem with large file moving between server

Andrew Holway andrew.holway at gmail.com
Sun Sep 21 10:59:00 PDT 2014


>
> Regarding ZFS: is that available for Linux now? I lost a bit track here.
>

Yes.

http://zfsonlinux.org/

I would say its ready for production now. Intel are about to start
supporting it under Lustre in the next couple of months and they are
typically careful about such things.

Cheers,

Andrew



>
> All the best from London
>
> Jörg
>
> On Sonntag 21 September 2014 you wrote:
> > Hi Jörg,
> >
> > Sounds like a "typical" but very uncommon silent data corruption problem.
> > If you have another copy of the data, compare to that? If you don't have
> > another copy, accept the fact that some of your data maybe got silently
> > corrupted.
> >
> > Most RAID controllers do periodic "scrubbing"; was your Infortrend doing
> > that?
> >
> > For the new system, consider using ZFS pointed at plain disks, as it may
> > have more layers of checksums compared to your current system.
> >
> > Regards,
> > Alex
> >
> > On Sunday, September 21, 2014, Jörg Saßmannshausen <
> >
> > j.sassmannshausen at ucl.ac.uk> wrote:
> > > Dear all,
> > >
> > > I got a rather strange problem with one of my file servers which I
> > > recently have upgraded in order to accommodate more disc space.
> > >
> > > The problem: I have copies the files from the old file space to a
> > > temporary disc
> > > storage space using this rsync command:
> > >
> > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > > tempspace:baa
> > >
> > > I am doing this now for some years and never had any problems.
> > >
> > > As always, I am running md5sum afterwards to be sure ther is not a
> > > problem later and the user is loosing data. This time around a rather
> > > large file (around 16 GB) the md5sum failed after I moved the files
> from
> > > the temp space
> > > back to the new destination using the same command as above.
> > >
> > > Having still access to the old file space, I decided to move this file
> > > from the
> > > old file space. Strangely enough, rsync does not sync the file again
> so I
> > > had to
> > > delete the file. Even after deleting the file and re-sync it from the
> old
> > > source, the md5sum is wrong.
> > >
> > > Copying the file to a different file space did not cause these problem,
> > > i.e. the
> > > md5sum is correct.
> > > As it is a tar.gz file, I simply decided to decompress the original
> file
> > > on the
> > > different file server. That worked. The file where the md5sum is wrong
> > > did not
> > > decompress on the different file server but crashed with an error
> message
> > > when I
> > > executed gunzip. So the file is broken.
> > >
> > > The setup:
> > >
> > > Originally I was using an old Infortrand box which had old PATA discs
> in
> > > it.
> > > This box is connected via scsi to a frontend server which exports the
> > > file space via iscsi. The backend for that, i.e. the one the user is
> > > accessing is
> > > on a different physical machine and it is a XEN guest. The reason
> behind
> > > that
> > > setting is as the frontend is acting as a backup server and I don't
> want
> > > people to have access to it.
> > > I then exchanged the Infortrend box with a more recent model which got
> > > SATA capeabilities but still got scsi connection to the frontend. The
> > > frontend is
> > > the same. I got a new controller for that box as the old one was
> broken.
> > > There is no changes in the backend, that is still the same XEN guest on
> > > the same hardware.
> > >
> > > What I cannot work out is why the old Infortrend box does not have any
> > > problems with the new file, the newer one has a problem here. Also,
> when
> > > I have
> > > copied over some files (again using the rsync command above) a few
> files
> > > did not
> > > copy correctly (again md5sum) in the first instance but done so later.
> > >
> > > I find that highly alarming as that means that at least for larger
> and/or
> > > some
> > > binary files there seems to be a problem. However, I am not sure there
> to
> > > look
> > > at it as I am out of ideas.
> > >
> > > Could it be there is a problem with the 'new' controller?
> > > In all cases I was using ext4 as a file system and I did not have any
> > > problems
> > > with that.
> > >
> > > Anybody got some sentiments here?
> > >
> > > All the best from a sunny London
> > >
> > > Jörg
> > >
> > > P.S. To make things worse I am off on a work related trip from Monday
> > > onwards
> > > and I am working on that problem since Friday evening.
> > >
> > >
> > >
> > > --
> > > *************************************************************
> > > Dr. Jörg Saßmannshausen, MRSC
> > > University College London
> > > Department of Chemistry
> > > Gordon Street
> > > London
> > > WC1H 0AJ
> > >
> > > email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> > > web: http://sassy.formativ.net
> > >
> > > Please avoid sending me Word or PowerPoint attachments.
> > > See http://www.gnu.org/philosophy/no-word-attachments.html
>
>
> --
> *************************************************************
> Dr. Jörg Saßmannshausen, MRSC
> University College London
> Department of Chemistry
> Gordon Street
> London
> WC1H 0AJ
>
> email: j.sassmannshausen at ucl.ac.uk
> web: http://sassy.formativ.net
>
> Please avoid sending me Word or PowerPoint attachments.
> See http://www.gnu.org/philosophy/no-word-attachments.html
>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/c4d375c8/attachment-0001.html>


More information about the Beowulf mailing list