Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] NFS+XFS+SMP on kernel 2.6

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Paul Nowoczynski pauln at psc.edu
Wed Jun 15 12:18:12 PDT 2005


This is different from what was crashing my io nodes.
If you think think that low memory has something to do with the
crash then try setting "/proc/sys/vm/min_free_kybtes" to 65536
and set "/proc/sys/vm/lower_zone_protection" to 100 or 200.
These settings will prevent the pagebuffer from hogging all your
memory.
paul

Suvendra Nath Dutta wrote:

> /var/log/messages
>
>
> Jun 14 16:39:48 sauron kernel: ----------- [cut here ] --------- 
> [please bite here ] ---------
> Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
> Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
> Jun 14 16:39:48 sauron kernel: CPU 1
> Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs dm_mod
> Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted 
> 2.6.8.1-suse91-osmp
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>] 
> <ffffffff802c9456>{cmn_err+278}
> Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8  EFLAGS: 
> 00010246
> Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX: 
> 0000000000000000 RCX: ffffffff805b4ae8
> Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI: 
> 0000000000000001 RDI: 000001006e6aab30
> Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08: 
> 0000000000000001 R09: 0000000000000001
> Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11: 
> 0000000000000000 R12: 0000010033f47af0
> Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14: 
> 000001007e169000 R15: 000001007cf53a38
> Jun 14 16:39:50 sauron kernel: FS:  0000002a9588d6e0(0000) 
> GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
> Jun 14 16:39:50 sauron kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
> 000000008005003b
> Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3: 
> 0000000037eca000 CR4: 00000000000006e0
> Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo 
> 00000100791d0000, task 000001006e6aab30)
> Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001 
> 0000000000000293 0000003000000020 00000100791d18a8
> Jun 14 16:39:51 sauron kernel:        00000100791d17e8 
> ffffffff80153b08 0000000000001000 ffffffff8017677a
> Jun 14 16:39:51 sauron kernel:        0000010078a8d080 0000010033f47ac0
> Jun 14 16:39:51 sauron kernel: Call 
> Trace:<ffffffff80153b08>{find_get_page+24} 
> <ffffffff8017677a>{__find_get_block_slow+74}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8ef8>{vn_purge+328} 
> <ffffffff80177e98>{unmap_underlying_metadata+8}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c7c99>{linvfs_alloc_inode+41} 
> <ffffffff8018e6a6>{iget_locked+230}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c91ec>{vn_initialize+124} <ffffffff802a02b6>{xfs_iget+358}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c8fe4>{vn_remove+68} 
> <ffffffff802b6b73>{xfs_vget+51}
> Jun 14 16:39:51 sauron kernel:        <ffffffff802c87d8>{vfs_vget+40} 
> <ffffffff802a9e41>{xlog_write+1057}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802c77eb>{linvfs_get_dentry+59} 
> <ffffffff802186f0>{find_exported_dentry+64}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021bdf0>{nfsd_acceptable+0} 
> <ffffffff8047b011>{sock_alloc_send_pskb+113}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80491b88>{rt_hash_code+56} 
> <ffffffff80493c10>{__ip_route_output_key+48}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff804819fd>{netif_receive_skb+381} 
> <ffffffffa0013327>{:tg3:tg3_enable_ints+23}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8049a319>{ip_append_data+809} 
> <ffffffff8048f783>{qdisc_restart+35}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8022084e>{exp_find_key+126} 
> <ffffffff80218d7b>{export_decode_fh+123}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021bc31>{fh_verify+961} 
> <ffffffff80135230>{autoremove_wake_function+0}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80135230>{autoremove_wake_function+0} 
> <ffffffff8021d6d8>{nfsd_open+56}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8021da3b>{nfsd_write+107} 
> <ffffffff8036e63f>{scsi_end_request+223}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff8036e84c>{scsi_io_completion+492} 
> <ffffffff8015b99e>{cache_flusharray+110}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80504bd2>{ip_map_lookup+306} 
> <ffffffff805053a5>{svcauth_unix_accept+597}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff802252d1>{nfsd3_proc_write+241} 
> <ffffffff80218f60>{nfsd_dispatch+256}
> Jun 14 16:39:51 sauron kernel:        
> <ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219465>{nfsd+581} 
> <ffffffff801332ee>{schedule_tail+14}
> Jun 14 16:39:51 sauron kernel:        <ffffffff801102a7>{child_rip+8} 
> <ffffffff80219220>{nfsd+0}
> Jun 14 16:39:51 sauron kernel:        <ffffffff80219220>{nfsd+0} 
> <ffffffff8011029f>{child_rip+0}
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel:
> Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a 
> 00 48 81 c4 e0 00 00 00 5b
> Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278} RSP 
> <00000100791d17b8>
>
> On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:
>
>> What kernel bug did you run into?  Was it a page_allocation failure?
>> paul
>>
>> Suvendra Nath Dutta wrote:
>>
>>> We set up a 160 node cluster with a dual processor head node with 
>>> 2GB RAM. The head node also has two RAID devices attached to two 
>>> SCSI cards. These have a XFS filesystem on them and are NFS exported 
>>> to the cluster. The head node runs very low on memory (7-8 MB). And 
>>> today I ran into a kernel bug that crashed the system. Google 
>>> suggests that I should upgrade to kernel 2.6.11, but that sounds 
>>> very unpleasant. I am thinking of putting the raid boxes on a 
>>> different box. Will separating the file-server and the head node 
>>> give me back stability on the head node?
>>>
>>> Suvendra.
>>>
>>> _______________________________________________
>>> Beowulf mailing list, Beowulf at beowulf.org
>>> To change your subscription (digest mode or unsubscribe) visit 
>>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>>




More information about the Beowulf mailing list