diskless nodes? (was Re: Xbox clusters?)
math at velocet.ca
Fri Dec 7 13:04:35 PST 2001
On Fri, Dec 07, 2001 at 09:06:00AM -0500, Carlos O'Donell Jr.'s all...
> > As always YMMV. But quantum chemistry is an application in which the disk
> > access can be crucial. I've seen differences in total run time for a job
> > of a factor 3 between a disk accesed thru DMA and Non-DMA. If you have to
> > do all scratch space over NFS you will even have lower performance.
> > Groeten, David.
> > ________________________________________________________________________
> > Dr. David van der Spoel, Biomedical center, Dept. of Biochemistry
> > Husargatan 3, Box 576, 75123 Uppsala, Sweden
> > phone: 46 18 471 4205 fax: 46 18 511 755
> > spoel at xray.bmc.uu.se spoel at gromacs.org http://zorn.bmc.uu.se/~spoel
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Have you tested this with NFSv3 on a 100Mbit switched network?
> (With enough BW on the backplane of your switch)
I've had jobs running on only 1,5,10,20 nodes and they always sit
around 2Mbps. I mean if Im getting 98% cpu in usertime on the jobs
Im pretty happy. THat 2% is barely worth my time investigating recovery.
Its when we float around 90% that it might be worth looking at.
Since this cluster was built very efficiently, we actually dont
require switches with alot of bandwidth: we have a number of 4 port 10/100
cards in the NFS server - and even with splitting the cluster up so that
there's no more than 6 nodes on each physical network, I see the same
behaviour (but this is just for the types of G98 jobs we run, I am
sure there are many G98 jobs that require thrashing the disk pretty hard).
> I'd be interested in seeing your results, as the network
> may offer better latency than the disk (depends on disk and network
The other thing is that our NFS server has alot of ram so if somethings
not cached locally then if its been accessed alot recently then it
may be in the cache there.
> I would assume also that since you are doing local disk writes
> that the information is "scratch-pad" in nature?
In this case we're doing scratch writing stuff with G98 since there's
really no way around it. However, with all its super frequent small
writes and reads, we've been using a swapped-backed memory disk.
That was a bit of a trick to configure, but its really sweet. We have more ram
than the calculations require by far on the nodes and we give them a 20 gig
scratch swap-backed memory disk. The size in ram is usually around 384 or 512
megs depending on various things, but this takes care of what seems to be a
large number of the small frequent file writes. (regular cache for NFSv3 can
do this as well, if you have your locking setup correctly, but it doenst seem
to have quite as high a hit rate). Perhaps this is the reason I find non-
local disk so acceptable. :) (but really, a default configuration for nfsv3
using the local cache should help you out a fair bit with frequent small
writes and rereading the same file just written).
> Just pondering again the issues of latency:
> 1. Does a poorly written driver and or NIC hardware produce
> that much higher latency? (/me looks at 3C59x based card in box)
On 1000Mbit I can see it making a very big difference. On 100Mbit cards
it really depends on how big your frame sizes are as well as DMA access,
buffers and other factors.
> 2. Does the latency of the disk as a physical device really
> exceed that of a qualitatively _good_ network card on a 100Mbit
> switched topology? (No definition of "good" yet).
Not sure on that - when you are accessing large files, it hardly matters if
its a 2ms latency or a 20ms latency. For small files you are hopefully using
> I've heard a lot of good and bad things with regards to:
> 1. Shared network memory use NBD / other technology.
NBD looks interesting. Experimenting with it but nothing in production yet.
> 2. Diskless nodes.
> And both seem to be a factor of:
> Process Related:
> a. How much traffic you already have on the network ?
> b. Are your local disks fast / always on ?
> c. The peak BW you _expect_ out of that tier/level of storage?
> d. The sustained BW you would like at any given point?
> e. Software Maintenance
> - Request/Change Cycle
> f. Hardware Maintenance
> - Break/Replace or New-Install cycles
> Maybe it's time to do some tests and produce pretty graphs.
> Also note that some items in "Administrativa" go away when
> you buy from larger companies that have maintenance
> warranties (conceed that you can't know in advance when
> things will break).
Agreed, but we're relatively technical so we can play with stuff.
> Carlos O'Donell Jr.
> University of Western Ontario
> Baldric http://www.baldric.uwo.ca
> Admin / Developer / User
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
Ken Chase, math at velocet.ca * Velocet Communications Inc. * Toronto, CANADA
More information about the Beowulf