[Beowulf] NFS+XFS+SMP on kernel 2.6

Suvendra Nath Dutta sdutta at cfa.harvard.edu
Wed Jun 15 08:35:59 PDT 2005


/var/log/messages


Jun 14 16:39:48 sauron kernel: ----------- [cut here ] --------- 
[please bite here ] ---------
Jun 14 16:39:48 sauron kernel: Kernel BUG at debug:106
Jun 14 16:39:48 sauron kernel: invalid operand: 0000 [1] SMP
Jun 14 16:39:48 sauron kernel: CPU 1
Jun 14 16:39:48 sauron kernel: Modules linked in: e1000 tg3 subfs dm_mod
Jun 14 16:39:48 sauron kernel: Pid: 10070, comm: nfsd Not tainted 
2.6.8.1-suse91-osmp
Jun 14 16:39:48 sauron kernel: RIP: 0010:[cmn_err+278/299] 
<ffffffff802c9456>{cmn_err+278}
Jun 14 16:39:48 sauron kernel: RIP: 0010:[<ffffffff802c9456>] 
<ffffffff802c9456>{cmn_err+278}
Jun 14 16:39:48 sauron kernel: RSP: 0018:00000100791d17b8  EFLAGS: 
00010246
Jun 14 16:39:48 sauron kernel: RAX: 0000000000000050 RBX: 
0000000000000000 RCX: ffffffff805b4ae8
Jun 14 16:39:48 sauron kernel: RDX: ffffffff805b4ae8 RSI: 
0000000000000001 RDI: 000001006e6aab30
Jun 14 16:39:48 sauron kernel: RBP: 0000010033f47ac0 R08: 
0000000000000001 R09: 0000000000000001
Jun 14 16:39:50 sauron kernel: R10: 0000000000000000 R11: 
0000000000000000 R12: 0000010033f47af0
Jun 14 16:39:50 sauron kernel: R13: 0000000098ee8d60 R14: 
000001007e169000 R15: 000001007cf53a38
Jun 14 16:39:50 sauron kernel: FS:  0000002a9588d6e0(0000) 
GS:ffffffff806f5040(0000) knlGS:0000000062693bb0
Jun 14 16:39:50 sauron kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
000000008005003b
Jun 14 16:39:51 sauron kernel: CR2: 0000002a9558c000 CR3: 
0000000037eca000 CR4: 00000000000006e0
Jun 14 16:39:51 sauron kernel: Process nfsd (pid: 10070, threadinfo 
00000100791d0000, task 000001006e6aab30)
Jun 14 16:39:51 sauron kernel: Stack: 0000000000000001 0000000000000293 
0000003000000020 00000100791d18a8
Jun 14 16:39:51 sauron kernel:        00000100791d17e8 ffffffff80153b08 
0000000000001000 ffffffff8017677a
Jun 14 16:39:51 sauron kernel:        0000010078a8d080 0000010033f47ac0
Jun 14 16:39:51 sauron kernel: Call 
Trace:<ffffffff80153b08>{find_get_page+24} 
<ffffffff8017677a>{__find_get_block_slow+74}
Jun 14 16:39:51 sauron kernel:        <ffffffff802c8ef8>{vn_purge+328} 
<ffffffff80177e98>{unmap_underlying_metadata+8}
Jun 14 16:39:51 sauron kernel:        
<ffffffff802c7c99>{linvfs_alloc_inode+41} 
<ffffffff8018e6a6>{iget_locked+230}
Jun 14 16:39:51 sauron kernel:        
<ffffffff802c91ec>{vn_initialize+124} <ffffffff802a02b6>{xfs_iget+358}
Jun 14 16:39:51 sauron kernel:        <ffffffff802c8fe4>{vn_remove+68} 
<ffffffff802b6b73>{xfs_vget+51}
Jun 14 16:39:51 sauron kernel:        <ffffffff802c87d8>{vfs_vget+40} 
<ffffffff802a9e41>{xlog_write+1057}
Jun 14 16:39:51 sauron kernel:        
<ffffffff802c77eb>{linvfs_get_dentry+59} 
<ffffffff802186f0>{find_exported_dentry+64}
Jun 14 16:39:51 sauron kernel:        
<ffffffff8021bdf0>{nfsd_acceptable+0} 
<ffffffff8047b011>{sock_alloc_send_pskb+113}
Jun 14 16:39:51 sauron kernel:        
<ffffffff80491b88>{rt_hash_code+56} 
<ffffffff80493c10>{__ip_route_output_key+48}
Jun 14 16:39:51 sauron kernel:        
<ffffffff804819fd>{netif_receive_skb+381} 
<ffffffffa0013327>{:tg3:tg3_enable_ints+23}
Jun 14 16:39:51 sauron kernel:        
<ffffffff8049a319>{ip_append_data+809} 
<ffffffff8048f783>{qdisc_restart+35}
Jun 14 16:39:51 sauron kernel:        
<ffffffff8022084e>{exp_find_key+126} 
<ffffffff80218d7b>{export_decode_fh+123}
Jun 14 16:39:51 sauron kernel:        <ffffffff8021bc31>{fh_verify+961} 
<ffffffff80135230>{autoremove_wake_function+0}
Jun 14 16:39:51 sauron kernel:        
<ffffffff80135230>{autoremove_wake_function+0} 
<ffffffff8021d6d8>{nfsd_open+56}
Jun 14 16:39:51 sauron kernel:        
<ffffffff8021da3b>{nfsd_write+107} 
<ffffffff8036e63f>{scsi_end_request+223}
Jun 14 16:39:51 sauron kernel:        
<ffffffff8036e84c>{scsi_io_completion+492} 
<ffffffff8015b99e>{cache_flusharray+110}
Jun 14 16:39:51 sauron kernel:        
<ffffffff80504bd2>{ip_map_lookup+306} 
<ffffffff805053a5>{svcauth_unix_accept+597}
Jun 14 16:39:51 sauron kernel:        
<ffffffff802252d1>{nfsd3_proc_write+241} 
<ffffffff80218f60>{nfsd_dispatch+256}
Jun 14 16:39:51 sauron kernel:        
<ffffffff80501123>{svc_process+947} <ffffffff80219220>{nfsd+0}
Jun 14 16:39:51 sauron kernel:        <ffffffff80219465>{nfsd+581} 
<ffffffff801332ee>{schedule_tail+14}
Jun 14 16:39:51 sauron kernel:        <ffffffff801102a7>{child_rip+8} 
<ffffffff80219220>{nfsd+0}
Jun 14 16:39:51 sauron kernel:        <ffffffff80219220>{nfsd+0} 
<ffffffff8011029f>{child_rip+0}
Jun 14 16:39:51 sauron kernel:
Jun 14 16:39:51 sauron kernel:
Jun 14 16:39:51 sauron kernel: Code: 0f 0b cc 63 53 80 ff ff ff ff 6a 
00 48 81 c4 e0 00 00 00 5b
Jun 14 16:39:51 sauron kernel: RIP <ffffffff802c9456>{cmn_err+278} RSP 
<00000100791d17b8>

On Jun 15, 2005, at 10:57 AM, Paul Nowoczynski wrote:

> What kernel bug did you run into?  Was it a page_allocation failure?
> paul
>
> Suvendra Nath Dutta wrote:
>
>> We set up a 160 node cluster with a dual processor head node with 2GB 
>> RAM. The head node also has two RAID devices attached to two SCSI 
>> cards. These have a XFS filesystem on them and are NFS exported to 
>> the cluster. The head node runs very low on memory (7-8 MB). And 
>> today I ran into a kernel bug that crashed the system. Google 
>> suggests that I should upgrade to kernel 2.6.11, but that sounds very 
>> unpleasant. I am thinking of putting the raid boxes on a different 
>> box. Will separating the file-server and the head node give me back 
>> stability on the head node?
>>
>> Suvendra.
>>
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org
>> To change your subscription (digest mode or unsubscribe) visit 
>> http://www.beowulf.org/mailman/listinfo/beowulf
>




More information about the Beowulf mailing list