[Beowulf] Torrents for HPC

Thu Jun 14 09:14:27 PDT 2012

On 06/13/2012 11:59 PM, Bill Broadley wrote:
> On 06/13/2012 06:40 AM, Bernd Schubert wrote:
>> What about an easy to setup cluster file system such as FhGFS?
>
> Great suggestion.  I'm all for a generally useful parallel file systems
> instead of torrent solution with a very narrow use case.
>
>> As one of
>> its developers I'm a bit biased of course, but then I'm also familiar
>
> I think this list is exactly the place where a developer should jump in
> and suggest/explain their solutions as it related to use in HPC clusters.
>
>> with Lustre, an I think FhGFS is far more easiy to setup. We also do not
>> have the problem to run clients and servers on the same node and so of
>> our customers make heavy use of that and use their compute nodes as
>> storage servers. That should a provide the same or better throughput as
>> your torrent system.
>
> I found the wiki, the "view flyer", FAQ, and related.
>
> I had a few questions, I found this link
> http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#ha_support but was not
> sure of the details.
>
> What happens when a metadata server dies?
>
> What happens when a storage server dies?

Right, those two issues we are presently actively working on. So the 
current release relies on hardware raid. But later on this year there 
will be meta data mirroring. After that data mirroring will follow.

>
> If either above is data loss/failure/unreadable files is there a
> description of how to improve against this with drbd+heartbeat or
> equivalent?

During the next weeks we will test fhgfs-ocf scripts for an HA 
(pacemaker) installation. As we are going to be paid for the 
installation, I do no know yet when we will make those scripts 
publically available. Generally drbd+heartbeat as mirroring solution is 
possible.

>
> Sounds like source is not available, and only binaries for CentOS?

Well, RHEL5 / RHEL6 based, SLES10 / SLES11 and Debian. And sorry, the 
server daemons are not open source yet. I think the more people asking 
to open it, the faster this process will be. Especially if those people 
also are going to buy support contracts :)

>
> Looks like it does need a kernel module, does that mean only old 2.6.X
> CentOS kernels are supported?

Oh, on the contrary. We basically support any kernel beginning with 
2.6.16 onwards. Even support for most recent vanilla kernels is usually 
done within a few weeks after its release.

>
> Does it work with mainline ofed on qlogic and mellanox hardware?

Definitely works with both and RDMA (ibverbs) transfers.
As QLogic has some problems with ibverbs, we had a cooperation with 
QLogic to improve performance on their hardware. Recent QLogic OFED 
stacks do include performance fixes.
Please also see
http://www.fhgfs.com/wiki/wikka.php?wakka=NativeInfinibandSupport
for (QLogic) tuning advises.

>
>   From a sysadmin point of view I'm also interested in:
> * Do blocks auto balance across storage nodes?

Actually files are balanced. The default file stripe count is 4, but can 
be adjusted by the admin. So assuming you would have only one target per 
server, a large file would be distributed over 4 nodes. The default 
chunk size is 512kB. For files smaller than that size there is no 
stripe-overhead.

> * Is managing disk space, inodes (or equiv) and related capacity
>     planning complex?  Or does df report useful/obvious numbers?

Hmm, right now (unix) "df -i" does not report the inode usage yet for 
fhgfs. We will fix that in later releases.
At least for traditional storage severs we recommend to use ext4 on 
meta-data partitions for performance reasons. For storage partitions we 
usually recommend XFS, again for performance.
Also, storage and meta-data can be on the very same partion, you just 
need configure the path were to find those data in the corresponding 
config files.
If you are going to use all your client nodes as fhgfs servers and those 
already have XFS as scratch partion, XFS is probably also fine. However, 
due a severe XFS performance issue, you should either need a kernel to 
have this issue fixed or you should disable meta-data-as-xattr
(in fhgfs-meta.conf: storeUseExtendedAttribs = false).
Also please see here for a discussion and benchmarks
http://oss.sgi.com/archives/xfs/2011-08/msg00233.html

Christoph Hellwig then fixed the unlink issue later on and this patch 
should be in all recent linux-stable kernels. I have not checked 
RHEL5/RHEL6, though.

Anyway, if you are going use ext4 on your meta-data partition, you need 
to make sure yourself you do have sufficient inodes available. Our wiki 
has recommendations for mkfs.ext4 options.

> * Can storage nodes be added/removed easily by migrating on/off of
>     hardware?

Adding storage nodes on the fly works perfectly fine. Our fhgfs-ctl tool 
also has a mode to migrate files off a storage node. However, we really 
recommend not to do that while clients are writing to the file system 
right now. Reason is that we do not lock files-in-migration yet and a 
client then might write to unlinked files, which would result in silent 
data loss. We have on-the-fly data migration on our todo list, but I 
cannot say yet, when that is going to come.
If you are going to use your clients as storage nodes, you could specify 
that system as preferred system to write files to. That would easily 
allow to remove that node...

> * Is FhGFS handle 100% of the distributed file system responsibilities
>     or does it layer on top of xfs/ext4 or related?  (like ceph)

Like ceph on top of other file systems, such as xfs or ext4.

> * With large files does performance scale reasonably with storage
>     servers?

Yes, you may also adjust the stripe count by your needs. Default stripe 
count is 4, which approximately provides the performance of 4 storage 
targets.

> * With small files does performance scale reasonably with metadata
>     servers?

Striping over different meta data servers is done on a per-directory 
basis. As most users and applications work in different directories, 
meta data performance usually scales linearily with the number of 
metadata servers.
Please note: Our wiki has tuning advices for meta data performance and 
with our next major release we also should see a greatly improved meta 
data performance.

Hope it helps and please let me know if you have further questions!

Cheers,
Bernd

PS: We have a GUI, which should help you to just try it out within a few 
minutes. Please see here:
http://www.fhgfs.com/wiki/wikka.php?wakka=GUIbasedInstallation