[Beowulf] network filesystem

Tue Mar 6 09:00:24 PST 2007

Mark Hahn wrote:
>>> writing to different sections of a file is probably wrong on any
>>> networked FS, since there will inherently be obscure interactions
>>> with the size and alignment of the writes vs client pagecache,
>>
>> I'm rather surprised to see that sentiment on a mailing list for high
>> performance clusters :>
>
> smiley noted, but I would suggest that HPC is not about convenience 
> first - simply having each node write to a separate file eliminates 
> any such issue,
> and is hardly an egregious complication to the code.

Actually this can greatly complicate code. If I run a CFD run on n number of
processes and they each write the solution to a separate file, then if I run
1.5*n processes, how do I read the n files? I can write some code to 
take the
n files, and then write out a single file or 1.5*n files for instance. 
To me this
is a wasteful use of cycles when something like MPI-IO is so much better
and I can stick with a single file.

While I don't want to speak for the entire CFD community, but I haven't
seen anyone write out n files. That concept was proven to be a huge pain
many years ago.

Other disciplines may have other opinions of course.

>> I would contend that writing to different sections of a file *must* be
>> supported by any file system deployed on a cluster.  How else would
>> you get good performance from MPI-IO?
>
> who uses MPI-IO?  straight question - I don't believe any of our 1500 
> users do.

I do. I also know that some ISV's are moving rapidly to use MPI-IO.

>>> in my experience, people who expect it to "just work" have an
>>> incredibly naive model of how a network FS works (ie, write()
>>> produces an RPC direct to the server)
>>
>> I agree that the POSIX API and consistency semantics make it difficult
>> to achieve high I/O rates for common scientific workloads, and that
>> NFS is probably not the best solution for those truly parallel 
>> workloads.
>>
>> Fortunately,  there are good alternatives out there.
>
> starting with the question: "do you have a good reason to be writing 
> in parallel to the same file?".  I'm not saying the answer is never yes.

As Rob mentioned writing in parallel to the same file gets you good 
performance.
I think this is a fundamental underpinning of parallel IO. You can do 
this with
or without MPI-IO. MPI-IO just makes it easier, standard, and portable.

Of course you would not have different processes writing to the same region
of a file. But if you can have each process write to a distinct region 
or section
of the file without worrying about having another process stepping on that
one, then why not write in parallel? It's easy to do using MPI-IO. Take 
a look
at the tutorials on MPI-IO around the web and give them a try.

Jeff