[Beowulf] cluster storage design
alvin at ns.Linux-Consulting.com
Wed Mar 23 16:36:22 PST 2005
On Wed, Mar 23, 2005 at 09:41:46AM -0600, Brian Henerey wrote:
> I have a 32 node cluster with 1 master and 1 data storage server with 1.5
> TB's of storage. The master used to have storage:/home mounted on /home via
> NFS. I moved the 1.5TB RAID array of storage so it was directly on the
> master. This decreased the time it took for our program to run by a factor
> of 4.
yes .. that is a good thing
> I read somewhere that mounting the data to the master via NFS was a
> bad idea for performance, but am not sure what the best alternative is. I
> don't want to have to move data on/off the master each time I run a job
> because this will slow it down as more people are using it.
for users, you have 2 choices ??
/home on one big "home server"
or automagically sync users loginID and pwd from node to node
( little more work.. but not as bad as it sounds )
if the "home server" dies ... everybody is dead
if each node is standalone .. there is no issues with "master" dying
for running jobs ....
an automated queue is good ... users doesn't necessarily dictate
which nodes to run the jobs on, but a good queuer will allow
users to specify preferences
for "/data" where all nodes share a common big 100TB data farm ..
- you have NFS or SANs or ??
- getting good nic cards and good switches helps a lot
- change your NFS parameters to send 16K or 32K bytes at a time
instead of 512K
- dual or quad channel bonding should help with thruput too..
- a TB sized "/data" shouldn't be noticably slow across the nodes
- /data should be on the machine where the apps uses it the most
- since /data is probably shared across multiple nodes, it might
be worth it ( definitely worth it ) to buy another 4 or 8 disks
and use it as backups of /data on other nodes
- you now have 3 "master nodes" with local /data
- you will have to rsync and rdiff your changes from
node to node
- 1 TB of disks is about $600 now days ( 4x $150 each )
- structuring your /data into /data/xxx and /data/yyy and /data/zzz
will allow multiple nodes to have all of its data local to where
all the disk i/o access is being done local to itself as opposed
to across the slow ethernet
> I know there are probably many solutions but I'm curious what the people on
> this list do. It seems to me that SAN's are very expensive compared to just
> building servers with 4 x 500GB hard drives. I've considered just launching
> my lam-mpi jobs from whatever storage server has the appropriate data on it,
> but this doesn't seem ideal.
for me ... lots of redundant IDE disks is way way better/faster than san/nas
> How does performance compare from having the data local on the master via
> running it off a PVFS?
More information about the Beowulf