[Beowulf] NFS & Scaling issues
asingh at ideeinc.com
Mon Apr 9 14:13:29 PDT 2007
Thanks for the reply Joe....
atop seems to be a really cool tool that would be very helpful once I
get a chance to patch the kernel on file servers (for process level disk
usage and ethernet usage information.). I am setting up a test cluster
to reproduce the problem. Would post updates as I find more info...
Joe Landman wrote:
> Hi Amrik:
> Amrik Singh wrote:
>> We are running a cluster of 180 diskless compute nodes. 60 of them
>> have 32 bit AMD Semptron processors and rest are dual core AMD
>> Athelon 64 bit processors. 32 bit machines have 10/100 mbps and rest
>> have gigabit ethernet cards. We have four file servers, each hosting
>> around 3.5TB on SATA drives connected to 3Ware RAID controller cards
>> configured on RAID 10 array. These file servers are exporting the
>> drives through NFS. Each file server is running 265 daemons for nfsd.
>> The file servers are mainly hosting large number of small files
>> ranging from 256KB to 2 MB. The compute nodes are primarily doing a
>> search through these files, so there is lot's of reading and some
>> writing to the file servers.
>> Recently we started noticing very high (70-90%) wait states on the
>> file servers when compute nodes. We have tried to optimize the NFS
>> through increasing the number of daemons and the rsize and wsize but
>> to no avail.
>> Can someone point us in the right direction as to how we should be
>> trying to troubleshoot this problem.
> You might want to look at the read patterns.
>> PS: All the nodes are running SuSE 10.0 and servers are running
>> SuSE10.0 and 10.1 and all the drives are formatted with reiserfs.
> Hmmm... I remember Reiser has had a problem in the past when file
> systems get full or nearly so. There are file tail optimizations you
> might want to turn off, as well as use noatime for mounts. I might
> suggest turning to a better file system for your servers (if possible,
> it might not be a trivial undertaking), but even then that might not
> be responsible.
> Grab a copy of atop (google for it), run it on your file server. See
> if it is the file system that is problematic (disk devices running
> near 80% or higher capacity for reads/writes all the time).
> Other possibilities are your file access patterns, what the file
> server is doing itself, whether or not your networks are being flooded
> with small packets (see if your csw is very high, or the number of
> interrupts or packets are very high).
More information about the Beowulf