[Beowulf] integrating node disks into a cluster filesystem?

Fri Sep 25 15:59:49 PDT 2009

> users to cache data-in-progress to scratch space on the nodes.  But there's a 
> definite draw to a single global scratch space that scales automatically with 
> the cluster itself.

using node-local storage is fine, but really an orthogonal issue.
if people are willing to do it, it's great and scales nicely.
it doesn't really address the question of how to make use of 
3-8 TB per node.  we suggest that people use node-local /tmp, 
and like that name because it emphasizes the nature of the space.
currently we don't sweat the cleanup of /tmp (in fact we merely 
have the distro-default 10-day tmpwatch).

>> - obviously want to minimize the interference of remote IO to a node's 
>> jobs.
>>  for serial jobs, this is almost moot.  for loosely-coupled parallel jobs
>>  (whether threaded or cross-node), this is probably non-critical.  even for
>>  tight-coupled jobs, perhaps it would be enough to reserve a core for
>>  admin/filesystem overhead.
>
> I'd also strongly consider a separate network for filesystem I/O.

why?  I'd like to see some solid numbers on how often jobs are really
bottlenecked on the interconnect (assuming something reasonable like DDR IB).
I can certainly imagine it could be so, but how often does it happen?
is it only for specific kinds of designs (all-to-all users?)

>> - distributed filesystem (ceph?  gluster?  please post any experience!)  I
>>  know it's possible to run oss+ost services on a lustre client, but not
>>  recommended because of the deadlock issue.
>
> I played with PVFS1 a bit back in the day.  My impression at the time was

yeah, I played with it too, but forgot to mention it because it is afaik
still dependent on all nodes being up.  admittedly, most of the alternatives
also assume all servers are up...