[Beowulf] NFS+XFS+SMP on kernel 2.6
Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.
Robert G. Brown rgb at phy.duke.eduWed Jun 15 09:16:30 PDT 2005
- Previous message: [Beowulf] NFS+XFS+SMP on kernel 2.6 (Update)
- Next message: [Beowulf] NFS+XFS+SMP on kernel 2.6
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Suvendra Nath Dutta writes: > We set up a 160 node cluster with a dual processor head node with 2GB > RAM. The head node also has two RAID devices attached to two SCSI > cards. These have a XFS filesystem on them and are NFS exported to the > cluster. The head node runs very low on memory (7-8 MB). And today I > ran into a kernel bug that crashed the system. Google suggests that I > should upgrade to kernel 2.6.11, but that sounds very unpleasant. I am Why? FC3 and now FC4 are running 2.6.11 as their current kernel. If you are using a sound distribution with an update stream and yum to do your maintenance, moving to 2.6.11 should require you do to exactly -- nothing. Well, reboot the nodes and server after yum does an automated update to the new kernel, or do a yum update kernel by hand if you (sensibly enough) prevent it from updating kernels on your servers except by hand, so you have a chance to test the updates on a prototyping server on the side first. If you base your cluster on a sound rpm distro, yum will do pretty much all of this sort of thing for you, or with absolutely minimal personal effort. I'm pretty sure that Centos/RHEL 4.x also uses 2.6.11 in its current update stream, since I understand that is "is" FC4 but frozen except for passback updates. Another possibility, of course, is adding memory to just the server. Expensive, but not amortized over 160 nodes. Depending on your motherboard, you can probably use 1 GB DIMMS, in which case it isn't even that expensive. > thinking of putting the raid boxes on a different box. Will separating > the file-server and the head node give me back stability on the head > node? What do you mean by "head node" vs "file server"? To put the question another way, exactly what sort of load is being placed on the server/head node by the two different functions. This question has to be answered for YOUR particular usage pattern before you can answer your own question. For example, suppose your users connect to the head node infrequently to put a batch job in the queue, and those jobs typically run for a long time before finishing, and produce results that the user can then offload to their local workstation with a 1 second remote copy. This load is pretty much negligible, and splitting it off won't gain you much of anything. OTOH, suppose your users connect to the head node and run jobs that support a real-time visualization socket back to their desktop workstation that pumps all sorts of data in and out of the nodes on the same network channel as the file store and adds insult to injury by then REpumping it, somewhat digested on the head node, out to the remote workstation. This is now a (possibly) significant CPU and network load, and splitting it off might well help. Similar usage patterns might emerge associated with the file store itself -- some patterns produce very little server load so splitting it off is a waste of time, others might saturate the server's capacity and splitting it off WOULD help. Let's think very briefly about the two "trouble" cases. Suppose that your disk IS the culprit -- users are running jobs on all 160 nodes that are constantly reading from and writing to disks. In that case "just" splitting off the RAID might not be enough -- you might have to REARCHITECT the RAID to get greater parallelism, or put it on a better network e.g. IB instead of ethernet. Just splitting it off might stabilize your head node per se (so it didn't crash) but leave you with a disk server that crashed and jobs getting hung or delayed by disk access bottlenecks. Ditto analyzing your "head node" utilization at the application level -- the right solution might be task reorganization instead of just buying new hardware. At the very least you'll have to do some serious work to be able to KNOW that what you end up doing will actually solve the problem. HTH, rgb > > Suvendra. > > _______________________________________________ > Beowulf mailing list, Beowulf at beowulf.org > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://www.scyld.com/pipermail/beowulf/attachments/20050615/98f3ca75/attachment.bin
- Previous message: [Beowulf] NFS+XFS+SMP on kernel 2.6 (Update)
- Next message: [Beowulf] NFS+XFS+SMP on kernel 2.6
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Beowulf mailing list
