[Beowulf] serious NFS problem on Mandrake 10.0
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
David Mathog mathog at mendel.bio.caltech.eduFri Dec 3 15:23:02 PST 2004
- Previous message: [Beowulf] serious NFS problem on Mandrake 10.0
- Next message: [Beowulf] serious NFS problem on Mandrake 10.0
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
> > This is very, very, VERY bad. > > indeed. is it safe to assume your machines are quite stable > (memtest86-wise)? Yes. They run days and days without any errors. (S2466 motherboards with single Athlon MP 2200+ processors, ECC enabled.) >the fact that it's 32K is interesting, since > I suspect your NFS block size is that (see /proc/mounts to verify). Yes, that is the size for /u1 in /proc/mounts. I wrote a little script to beat on the NFS system with copies from remote nodes to the master, it's attached after my signature. It did the mv over NFS operation 100 times on each of 18 nodes and then did the md5sum when it got there. All running simultaneously. For 1800 of these network copies there were 142 where the md5sum didn't match. Note that I couldn't get an error out of this just running two at a time in a couple of tries so total load seems to matter, presumably on the master node, but maybe on the switch? Every node logged these sorts of errors and the variation looks like random scatter. Given the low rate (relatively speaking) it is probably one event per file, and since the files are around 1.3 Mb each, or about 40 blocks of 32k per mv, it seems like the error rate per 32k block is 142/(1800*40) = .00197. > does your > server have DIRECT_IO enabled on NFS? CONFIG_NFS_DIRECTIO=y > what kind of block device is it > writing too, It's an IBM scsi disk going through the Adaptec controller on the Tyan S2468UGN motherboard. Shows up as /dev/sde. Not sure if that answers the question. >and what filesystem for that matter? ext2 > or have you already > tried a different filesystem? No spare disk to build another filesystem on. Doesn't seem likely to be the file system since if it was giving those error rates in normal writes that disk would be swiss cheese by now. One last thing, one of these events was registered: Dec 3 14:52:10 safserver ifplugd(eth1)[1649]: Link beat lost. Dec 3 14:52:11 safserver ifplugd(eth1)[1649]: Link beat detected. but it wouldn't explain all the errors because they were scattered through the run time, and that took a lot longer than 1 second to complete. Network is 100baseT through a DLINK DSS-24 switch. Regards, David Mathog mathog at caltech.edu Manager, Sequence Analysis Facility, Biology Division, Caltech cat testmv.sh #this may wrap!!!! #!/bin/sh cd ~safrun NODE=`hostname` count=100 set `md5sum /tmp/SAVELASTMEGABLAST.txt` HOLDMD=$1 echo "initial md5sum is $HOLDMD" > /tmp/ERRORS.$NODE while [ $count -gt 1 ] do count=`expr $count - 1` cp /tmp/SAVELASTMEGABLAST.txt /tmp/TESTLAST.txt mv /tmp/TESTLAST.txt ./TESTLAST.txt.$NODE set `md5sum TESTLAST.txt.$NODE` NEWMD=$1 /bin/rm ./TESTLAST.txt.$NODE if [ "$NEWMD" != "$HOLDMD" ] then echo "error: md5sum is $NEWMD at $count" >>/tmp/ERRORS.$NODE fi done echo "error: done" >>/tmp/ERRORS.$NODE
- Previous message: [Beowulf] serious NFS problem on Mandrake 10.0
- Next message: [Beowulf] serious NFS problem on Mandrake 10.0
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
