[Beowulf] NFS+XFS+SMP on kernel 2.6

Paul Nowoczynski pauln at psc.edu
Wed Jun 15 12:18:12 PDT 2005


This is different from what was crashing my io nodes.
If you think think that low memory has something to do with the
crash then try setting "/proc/sys/vm/min_free_kybtes" to 65536
and set "/proc/sys/vm/lower_zone_protection" to 100 or 200.
These settings will prevent the pagebuffer from hogging all your
memory.
paul

Suvendra Nath Dutta wrote:

> /var/log/messages
>
>
> Jun 14 16:39:48 sauron kernel: ----------- [cut here ] --------- 
> [please bite here ] ---------
> Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
> Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
> Jun 14 16:39:48 sauron kernel: CPU 1
> Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs dm_mod
> Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted 
> 2.6.8.1-suse91-osmp
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8  EFLAGS: 
> 00010246
> Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX: 
> 0000000000000000 RCX: ffffffff805b4ae8
> Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI: 
> 0000000000000001 RDI: 000001006e6aab30
> Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08: 
> 0000000000000001 R09: 0000000000000001
> Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11: 
> 0000000000000000 R12: 0000010033f47af0
> Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14: 
> 000001007e169000 R15: 000001007cf53a38
> Jun 14 16:39:50 sauron kernel: FS:  0000002a9588d6e0(0000) 
> GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
> Jun 14 16:39:50 sauron kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
> Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3: 
> 0000000037eca000 CR4: 00000000000006e0
> Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo 
> 00000100791d0000, task 000001006e6aab30)
> Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001 
> 0000000000000293 0000003000000020 00000100791d18a8
> Jun 14 16:39:51 sauron kernel:        00000100791d17e8 
> ffffffff80153b08 0000000000001000 ffffffff8017677a
> Jun 14 16:39:51 sauron kernel:        0000010078a8d080 0000010033f47ac0
> Jun 14 16:39:51 sauron kernel: Call 
> Trace:<ffffffff80153b08>{find_get_page+24} 
> <ffffffff8017677a>{__find_get_block_slow+74}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8ef8>{vn_purge+328} 
> <ffffffff80177e98>{unmap_underlying_metadata+8}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c7c99>{linvfs_alloc_inode+41} 
> <ffffffff8018e6a6>{iget_locked+230}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c91ec>{vn_initialize+124} <ffffffff802a02b6>{xfs_iget+358}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8fe4>{vn_remove+68} 
> <ffffffff802b6b73>{xfs_vget+51}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c87d8>{vfs_vget+40} 
> <ffffffff802a9e41>{xlog_write+1057}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c77eb>{linvfs_get_dentry+59} 
> <ffffffff802186f0>{find_exported_dentry+64}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021bdf0>{nfsd_acceptable+0} 
> <ffffffff8047b011>{sock_alloc_send_pskb+113}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80491b88>{rt_hash_code+56} 
> <ffffffff80493c10>{__ip_route_output_key+48}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff804819fd>{netif_receive_skb+381} 
> <ffffffffa0013327>{:tg3:tg3_enable_ints+23}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8049a319>{ip_append_data+809} 
> <ffffffff8048f783>{qdisc_restart+35}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8022084e>{exp_find_key+126} 
> <ffffffff80218d7b>{export_decode_fh+123}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021bc31>{fh_verify+961} 
> <ffffffff80135230>{autoremove_wake_function+0}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80135230>{autoremove_wake_function+0} 
> <ffffffff8021d6d8>{nfsd_open+56}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021da3b>{nfsd_write+107} 
> <ffffffff8036e63f>{scsi_end_request+223}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8036e84c>{scsi_io_completion+492} 
> <ffffffff8015b99e>{cache_flusharray+110}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80504bd2>{ip_map_lookup+306} 
> <ffffffff805053a5>{svcauth_unix_accept+597}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802252d1>{nfsd3_proc_write+241} 
> <ffffffff80218f60>{nfsd_dispatch+256}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219465>{nfsd+581} 
> <ffffffff801332ee>{schedule_tail+14}
> Jun 14 16:39:51 sauron kernel:        <ffffffff801102a7>{child_rip+8} 
> <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219220>{nfsd+0} 
> <ffffffff8011029f>{child_rip+0}
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a 
> 00 48 81 c4 e0 00 00 00 5b
> Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278} RSP 
> <00000100791d17b8>
>
> On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:
>
>> What kernel bug did you run into?  Was it a page_allocation failure?
>> paul
>>
>> Suvendra Nath Dutta wrote:
>>
>>> We set up a 160 node cluster with a dual processor head node with 
>>> 2GB RAM. The head node also has two RAID devices attached to two 
>>> SCSI cards. These have a XFS filesystem on them and are NFS exported 
>>> to the cluster. The head node runs very low on memory (7-8 MB). And 
>>> today I ran into a kernel bug that crashed the system. Google 
>>> suggests that I should upgrade to kernel 2.6.11, but that sounds 
>>> very unpleasant. I am thinking of putting the raid boxes on a 
>>> different box. Will separating the file-server and the head node 
>>> give me back stability on the head node?
>>>
>>> Suvendra.
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>




More information about the Beowulf mailing list