[Beowulf] Torrents for HPC
bill at cse.ucdavis.edu
Wed Jun 13 14:59:16 PDT 2012
On 06/13/2012 06:40 AM, Bernd Schubert wrote:
> What about an easy to setup cluster file system such as FhGFS?
Great suggestion. I'm all for a generally useful parallel file systems
instead of torrent solution with a very narrow use case.
> As one of
> its developers I'm a bit biased of course, but then I'm also familiar
I think this list is exactly the place where a developer should jump in
and suggest/explain their solutions as it related to use in HPC clusters.
> with Lustre, an I think FhGFS is far more easiy to setup. We also do not
> have the problem to run clients and servers on the same node and so of
> our customers make heavy use of that and use their compute nodes as
> storage servers. That should a provide the same or better throughput as
> your torrent system.
I found the wiki, the "view flyer", FAQ, and related.
I had a few questions, I found this link
http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#ha_support but was not
sure of the details.
What happens when a metadata server dies?
What happens when a storage server dies?
If either above is data loss/failure/unreadable files is there a
description of how to improve against this with drbd+heartbeat or
Sounds like source is not available, and only binaries for CentOS?
Looks like it does need a kernel module, does that mean only old 2.6.X
CentOS kernels are supported?
Does it work with mainline ofed on qlogic and mellanox hardware?
From a sysadmin point of view I'm also interested in:
* Do blocks auto balance across storage nodes?
* Is managing disk space, inodes (or equiv) and related capacity
planning complex? Or does df report useful/obvious numbers?
* Can storage nodes be added/removed easily by migrating on/off of
* Is FhGFS handle 100% of the distributed file system responsibilities
or does it layer on top of xfs/ext4 or related? (like ceph)
* With large files does performance scale reasonably with storage
* With small files does performance scale reasonably with metadata
BTW, if anyone is current on any other parallel file system I'd (and I
suspect others on list) would find it very valuable. I run a hadoop
cluster, but I suspect there are others on list that could provide
better answer than I.
My lustre knowledge is second hand and dated.
More information about the Beowulf