[Beowulf] strange problem with large file moving between server

Jörg Saßmannshausen j.sassmannshausen at ucl.ac.uk
Sun Sep 21 11:08:34 PDT 2014


Hi Andrew,

thanks.

I will look into that. It is good to hear it is ready for production now. The 
last time I looked into it it was not.

All the best

Jörg

On Sonntag 21 September 2014 you wrote:
> > Regarding ZFS: is that available for Linux now? I lost a bit track here.
> 
> Yes.
> 
> http://zfsonlinux.org/
> 
> I would say its ready for production now. Intel are about to start
> supporting it under Lustre in the next couple of months and they are
> typically careful about such things.
> 
> Cheers,
> 
> Andrew
> 
> > All the best from London
> > 
> > Jörg
> > 
> > On Sonntag 21 September 2014 you wrote:
> > > Hi Jörg,
> > > 
> > > Sounds like a "typical" but very uncommon silent data corruption
> > > problem. If you have another copy of the data, compare to that? If you
> > > don't have another copy, accept the fact that some of your data maybe
> > > got silently corrupted.
> > > 
> > > Most RAID controllers do periodic "scrubbing"; was your Infortrend
> > > doing that?
> > > 
> > > For the new system, consider using ZFS pointed at plain disks, as it
> > > may have more layers of checksums compared to your current system.
> > > 
> > > Regards,
> > > Alex
> > > 
> > > On Sunday, September 21, 2014, Jörg Saßmannshausen <
> > > 
> > > j.sassmannshausen at ucl.ac.uk> wrote:
> > > > Dear all,
> > > > 
> > > > I got a rather strange problem with one of my file servers which I
> > > > recently have upgraded in order to accommodate more disc space.
> > > > 
> > > > The problem: I have copies the files from the old file space to a
> > > > temporary disc
> > > > storage space using this rsync command:
> > > > 
> > > > rsync -vrltH -pgo --stats -D --numeric-ids -x oldserver:foo
> > > > tempspace:baa
> > > > 
> > > > I am doing this now for some years and never had any problems.
> > > > 
> > > > As always, I am running md5sum afterwards to be sure ther is not a
> > > > problem later and the user is loosing data. This time around a rather
> > > > large file (around 16 GB) the md5sum failed after I moved the files
> > 
> > from
> > 
> > > > the temp space
> > > > back to the new destination using the same command as above.
> > > > 
> > > > Having still access to the old file space, I decided to move this
> > > > file from the
> > > > old file space. Strangely enough, rsync does not sync the file again
> > 
> > so I
> > 
> > > > had to
> > > > delete the file. Even after deleting the file and re-sync it from the
> > 
> > old
> > 
> > > > source, the md5sum is wrong.
> > > > 
> > > > Copying the file to a different file space did not cause these
> > > > problem, i.e. the
> > > > md5sum is correct.
> > > > As it is a tar.gz file, I simply decided to decompress the original
> > 
> > file
> > 
> > > > on the
> > > > different file server. That worked. The file where the md5sum is
> > > > wrong did not
> > > > decompress on the different file server but crashed with an error
> > 
> > message
> > 
> > > > when I
> > > > executed gunzip. So the file is broken.
> > > > 
> > > > The setup:
> > > > 
> > > > Originally I was using an old Infortrand box which had old PATA discs
> > 
> > in
> > 
> > > > it.
> > > > This box is connected via scsi to a frontend server which exports the
> > > > file space via iscsi. The backend for that, i.e. the one the user is
> > > > accessing is
> > > > on a different physical machine and it is a XEN guest. The reason
> > 
> > behind
> > 
> > > > that
> > > > setting is as the frontend is acting as a backup server and I don't
> > 
> > want
> > 
> > > > people to have access to it.
> > > > I then exchanged the Infortrend box with a more recent model which
> > > > got SATA capeabilities but still got scsi connection to the
> > > > frontend. The frontend is
> > > > the same. I got a new controller for that box as the old one was
> > 
> > broken.
> > 
> > > > There is no changes in the backend, that is still the same XEN guest
> > > > on the same hardware.
> > > > 
> > > > What I cannot work out is why the old Infortrend box does not have
> > > > any problems with the new file, the newer one has a problem here.
> > > > Also,
> > 
> > when
> > 
> > > > I have
> > > > copied over some files (again using the rsync command above) a few
> > 
> > files
> > 
> > > > did not
> > > > copy correctly (again md5sum) in the first instance but done so
> > > > later.
> > > > 
> > > > I find that highly alarming as that means that at least for larger
> > 
> > and/or
> > 
> > > > some
> > > > binary files there seems to be a problem. However, I am not sure
> > > > there
> > 
> > to
> > 
> > > > look
> > > > at it as I am out of ideas.
> > > > 
> > > > Could it be there is a problem with the 'new' controller?
> > > > In all cases I was using ext4 as a file system and I did not have any
> > > > problems
> > > > with that.
> > > > 
> > > > Anybody got some sentiments here?
> > > > 
> > > > All the best from a sunny London
> > > > 
> > > > Jörg
> > > > 
> > > > P.S. To make things worse I am off on a work related trip from Monday
> > > > onwards
> > > > and I am working on that problem since Friday evening.
> > > > 
> > > > 
> > > > 
> > > > --
> > > > *************************************************************
> > > > Dr. Jörg Saßmannshausen, MRSC
> > > > University College London
> > > > Department of Chemistry
> > > > Gordon Street
> > > > London
> > > > WC1H 0AJ
> > > > 
> > > > email: j.sassmannshausen at ucl.ac.uk <javascript:;>
> > > > web: http://sassy.formativ.net
> > > > 
> > > > Please avoid sending me Word or PowerPoint attachments.
> > > > See http://www.gnu.org/philosophy/no-word-attachments.html
> > 
> > --
> > *************************************************************
> > Dr. Jörg Saßmannshausen, MRSC
> > University College London
> > Department of Chemistry
> > Gordon Street
> > London
> > WC1H 0AJ
> > 
> > email: j.sassmannshausen at ucl.ac.uk
> > web: http://sassy.formativ.net
> > 
> > Please avoid sending me Word or PowerPoint attachments.
> > See http://www.gnu.org/philosophy/no-word-attachments.html
> > 
> > _______________________________________________
> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> > To change your subscription (digest mode or unsubscribe) visit
> > http://www.beowulf.org/mailman/listinfo/beowulf


-- 
*************************************************************
Dr. Jörg Saßmannshausen, MRSC
University College London
Department of Chemistry
Gordon Street
London
WC1H 0AJ 

email: j.sassmannshausen at ucl.ac.uk
web: http://sassy.formativ.net

Please avoid sending me Word or PowerPoint attachments.
See http://www.gnu.org/philosophy/no-word-attachments.html
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 230 bytes
Desc: This is a digitally signed message part.
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20140921/8f28c0cc/attachment.sig>


More information about the Beowulf mailing list