[Beowulf] diskless cluster nfs

Wed Dec 8 06:21:51 PST 2004

On Tue, 7 Dec 2004, Josh Kayse wrote:

> Ok, my first post, so please be gentle.
> 
> I've recently been tasked to build a diskless cluster for one of our
> engineers.  This was easy because we already had an image for the set
> of machines.  Once we started testing, the performance was very poor. 
> Basic setup follows:
> 
> Master node: system drive is 1 36GB SCSI drive
>                      /home raid5 5x 36GB SCSI drives
> Master node exports /tftpboot/192.168.1.x for the nodes.
> 
> all of the nodes are diskless and get their system from the master
> node over gigabit ethernet.
>  All that worsk fine.
> 
> The engineers use files over nfs for message passing, and no, they
> will not change their code to mpi even though it would be an
> improvement in terms of manageability and probably performance.
> 
> Basically, my question is:  what are some ways of testing the
> performance of nfs ande then, how can I improve the performance?
> 
> Thanks for any help in advance.
> 
> PS: nfs mount options: async,rsize=8192,wsize=8192,hard
>        file sizes: approx 2MB

Is this a trick question?  You begin by saying performance is poor.
Then you say that you (they) won't take the obvious step to improve your
performance.  Sigh.

OK, let's start by analyzing the problem.  You haven't said much about
the application.  Testing NFS is a fine idea, but before spending too
much time on any single metric of peformance let's analyze your cluster
and task.

You say you have gigabit ethernet between nodes.  You don't say how MANY
nodes you have, or how fast/what kind they are (even in general terms),
or how much memory they have, or whether they have one or two
processors.  These all matter.

Then there is the application.  If the nodes compute for five minutes,
then write a 2 MB file, then read a 2 MB file (or even several 2 MB
files) parallel scaling is likely to be pretty good, even on top of NFS.
If they compute 0.001 seconds, then write 2 MB and read 2+ MB, parallel
scaling is likely to be poor (NFS or not).  Why?

If you don't already know the answer, you should check out my online
book (http://www.phy.duke.edu/~rgb/Beowulf/beowulf_book.php) and read up
on Amdahl's Law and parallel scaling.  Let's do some estimation.  Forget
NFS.  The theoretical peak bandwidth of gigabit ethernet is 1000/8 = 125
MB/sec (this ignores headers and all sorts of reality).  It takes
(therefore) a minimum of 0.016 seconds to send 2 MB.  In the real world,
bandwidth is generally well under 125 MB/sec for a variety of reasons --
say 100 MB/sec.  If you are computing for only 0.001 seconds and then
communicating for 0.04 seconds, parallel scaling will be, um, "poor",
MPI or NFS notwithstanding.

Once you understand that fundamental ratio, you can determine what the
EXPECTED parallel scaling is of the application in a good world might
be.  A good world would be one where each node only communicated with
one other node (2 MB each way) AND the communications could proceed in
parallel.  A worse but still tolerable world might be one where the
communications can proceed at least partially in parallel without a
bottleneck -- one to many communications require (e.g. tree) algorithms
or broadcasts to proceed efficiently.

However, you do NOT live in a good world.  You have N hosts engaged
in what sounds like a synchronous computation (where everybody has to
finish a step before going on to the next) with a single communications
master (the NFS server).  Writing to the NFS server and reading from the
NFS server is strictly serialized.  If you have N hosts, it will take at
least Nx0.02 seconds to write all of the output files from a step of
computation, at least Nx0.02 seconds to READ all of the output files
(and that's assuming each node just reads one) and now you've got
something like 0.001 seconds of computation compared to Nx(0.04) or
worse seconds of communication.  The more nodes you add, the slower it
goes!  In fact, if you just KEEP the data on a single node it takes
Nx0.001 seconds to advance the computation a step compared to
0.001+Nx0.04 seconds in the cluster!  Even if you were computing one
second instead of 0.001, this sort of scaling relation will kill
parallel speedup at some number of nodes.

Note that I've gone into some detail here, because you are going to have
to explain this, in some detail, to your engineers after working out the
parallel scaling for the task at hand.  There is no way out of this or
around this.  Tuning the hell out of NFS is going to yield at most a
factor of 2 or so in speedup of the communications phase, and your
problem is probably a SCALING relation and bottlenecked COMMUNICATIONS
PATTERN that could care less about factors of 2 and is intrinsic to
serialized NFS.  In other words, your engineers are either going to have
to accept algebraically derived reality and restructure their code so it
speeds up (almost certainly abandoning the NFS communications model) or
accept that it not only won't speed up, it will actually slow down run
in parallel on your cluster.

Engineers tend to be pretty smart, especially about the constraints
imposed by unforgiving nature.  If you give them a short presentation
and teach them about parallel scaling, they'll probably understand that
it isn't a matter of tweaking something and NFS working, it is a
fundamental mathematical relation that keeps NFS from EVER working in
this context as an IPC channel.

Unless, of course, you compute for minutes compared to communicate for
seconds, in which case your speedup should be fine.  That's why you have
to analyze the problem itself and learn the details of its work pattern
BEFORE designing a cluster or parallelization.

   rgb

> -- 
> Joshua Kayse
> Computer Engineering
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org
> To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
> 

Robert G. Brown	                       http://www.phy.duke.edu/~rgb/
Duke University Dept. of Physics, Box 90305
Durham, N.C. 27708-0305
Phone: 1-919-660-2567  Fax: 919-660-2525     email:rgb at phy.duke.edu