[Beowulf] NFS & Scaling issues

Amrik Singh asingh at ideeinc.com
Mon Apr 9 14:13:29 PDT 2007

Thanks for the reply Joe....

atop seems to be a really cool tool that would be very helpful once I 
get a chance to patch the kernel on file servers (for process level disk 
usage and ethernet usage information.). I am setting up a test cluster 
to reproduce the problem. Would post updates as I find more info...


Joe Landman wrote:
> Hi Amrik:
> Amrik Singh wrote:
>> Hi,
>> We are running a cluster of 180 diskless compute nodes. 60 of them 
>> have 32 bit AMD Semptron processors and rest are  dual core AMD 
>> Athelon 64 bit processors. 32 bit machines have 10/100 mbps and rest 
>> have gigabit ethernet cards. We have four file servers, each hosting 
>> around 3.5TB on SATA drives connected to 3Ware RAID controller cards 
>> configured on RAID 10 array. These file servers are exporting the 
>> drives through NFS. Each file server is running 265 daemons for nfsd.
>> The file servers are mainly hosting large number of small files 
>> ranging from 256KB to 2 MB. The compute nodes are primarily doing a 
>> search through these files, so there is lot's of reading and some 
>> writing to the file servers.
>> Recently we started noticing very high (70-90%) wait states on the 
>> file servers when compute nodes. We have tried to optimize the NFS 
>> through increasing the number of daemons and the rsize and wsize but 
>> to no avail.
>> Can someone point us in the right direction as to how we should be 
>> trying to troubleshoot this problem.
> You might want to look at the read patterns.
>> PS: All the nodes are running SuSE 10.0 and servers are running 
>> SuSE10.0 and 10.1 and all the drives are formatted with reiserfs.
> Hmmm... I remember Reiser has had a problem in the past when file 
> systems get full or nearly so.  There are file tail optimizations you 
> might want to turn off, as well as use noatime for mounts.  I might 
> suggest turning to a better file system for your servers (if possible, 
> it might not be a trivial undertaking), but even then that might not 
> be responsible.
> Grab a copy of atop (google for it), run it on your file server.  See 
> if it is the file system that is problematic (disk devices running 
> near 80% or higher capacity for reads/writes all the time).
> Other possibilities are your file access patterns, what the file 
> server is doing itself, whether or not your networks are being flooded 
> with small packets (see if your csw is very high, or the number of 
> interrupts or packets are very high).
> Joe
>> thanks

More information about the Beowulf mailing list