[Beowulf] NFS over TCP or smth else... WHAT I've done wrong?

Yaroslav Halchenko list-beowulf at onerussian.com
Thu Feb 3 19:20:05 PST 2005


Dear Beowulfers,

Today is sad day for our 25 nodes cluster: I decided to improve its
performance and as a result I crippled it quite a lot.

The story is that for some reason many nodes started loosing connection
with the NFS server node, I started looking for a solution and decided
to try NFS over TCP. After I've adjusted configs across the cluster
(cfengine rulez), even rebooted the nodes (besides the main one) for
the sake of it, and put a slight load on a cluster (occupied 6 nodes
with intensive I/O which rw data from the NFS server) pretty much all of
60 nfsd instances start occupying CPU on the main node, so load reached
around 20 or 30 which is star hitting number... main node (NFS server)
start to behave unresponsively and start killing applications due to
reason of "running out of memory". 

So what is wrong in the next config:
vana:/raid        /raid   nfs defaults,tcp,hard,rw,nosuid,wsize=8192,rsize=8192
?
later I've adjusted it with bg,timeo=60,noatime to reduce the load but
it didn't quite help.

details about cluster: 23 active nodes at the moment running 2.6.8.1 SMP,
main node with 8GB, RPCNFSDCOUNT=70, nfs-kernel-server

What would be the best NFS config for it if we provide two directories
from the NFS server: 

/raid as rw,sync
and
/share/apps as ro,async

Thank you in advance

P.S. BTW - here is the dump from "killing mess"

Fixed up OOM kill of mm-less task
oom-killer: gfp_mask=0xd0
DMA per-cpu:
cpu 0 hot: low 2, high 6, batch 1
cpu 0 cold: low 0, high 2, batch 1
cpu 1 hot: low 2, high 6, batch 1
cpu 1 cold: low 0, high 2, batch 1
Normal per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16
HighMem per-cpu:
cpu 0 hot: low 32, high 96, batch 16
cpu 0 cold: low 0, high 32, batch 16
cpu 1 hot: low 32, high 96, batch 16
cpu 1 cold: low 0, high 32, batch 16

Free pages:     2969440kB (2966528kB HighMem)
Active:506964 inactive:611412 dirty:461 writeback:0 unstable:0 free:742360 slab:193835 mapped:269296 pagetables:2983
DMA free:1048kB min:16kB low:32kB high:48kB active:0kB inactive:0kB present:16384kB
protections[]: 8 476 732
Normal free:1864kB min:936kB low:1872kB high:2808kB active:32632kB inactive:21288kB present:901120kB
protections[]: 0 468 724
HighMem free:2966528kB min:512kB low:1024kB high:1536kB active:1995096kB inactive:2424488kB present:7471104kB
protections[]: 0 0 256
DMA: 0*4kB 15*8kB 10*16kB 8*32kB 2*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1048kB
Normal: 14*4kB 2*8kB 0*16kB 38*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1864kB
HighMem: 0*4kB 0*8kB 0*16kB 48126*32kB 16915*64kB 2081*128kB 109*256kB 57*512kB 20*1024kB 0*2048kB 0*4096kB = 2966528kB
Swap cache: add 538373, delete 522525, find 54148646/54172304, race 0+5
Out of Memory: Killed process 17465 (gnome-settings-).


-- 
                                  .-.
=------------------------------   /v\  ----------------------------=
Keep in touch                    // \\     (yoh@|www.)onerussian.com
Yaroslav Halchenko              /(   )\               ICQ#: 60653192
                   Linux User    ^^-^^    [175555]
             Key  http://www.onerussian.com/gpg-yoh.asc
GPG fingerprint   3BB6 E124 0643 A615 6F00  6854 8D11 4563 75C0 24C8

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20050203/a3da36ad/attachment.sig>


More information about the Beowulf mailing list