[Beowulf] dedupe filesystem

Bogdan Costescu Bogdan.Costescu at iwr.uni-heidelberg.de
Wed Jun 3 20:33:32 PDT 2009


On Wed, 3 Jun 2009, John Hearns wrote:

> Quite often the architecture of storage is a secondary consideration,
> in the rush to get a Shiny New Fast machine on site and working.

Well, I've seen it ignored even outside of that rush - in the design 
phase. And I confess of being guilty of doing this as well, but I 
learn from mistakes :-)

> In HPC, there are a lot of advantages in a central clustered 
> filesystem, where you can prepare your input data, run the 
> simulation, then at the end visualize the data.

In theory, this sounds nice, but in practice it can prove to be a bit 
more difficult, most times the human factor being the main culprit 
(just like with the scenarios I presented earlier). Administrative 
issues (who owns what) can seriously affect the possibility of 
coupling the HPC and visualisation resources, leading often to 
duplication of data. Stupid sysadmins or policies can leave the HPC 
resource with very basic text editors or terminal settings, leading 
the users to create the input set on their own workstation and 
constantly copying it over. Only the actual running of the simulation 
can be tightly linked to the clustered file system...

> The systems administrator must put in place strong policies on this -
> leave your data on the fast storage, it gets deleted after N weeks.

I see duplication of data in almost all cases as a human behaviour 
problem, not a technical one, which needs human behaviour solutions 
and not technical ones, so policies are a good solution. But I would 
argue that users' education is an even better solution - teach them 
why copying of data is bad and give them easy ways of safely sharing 
data with their collaborators and not only they will keep the file 
systems emptier but they will also thank you for the decrease in 
effort needed to manage the always increasing amounts of data. Such a 
solution is however only feasible for smaller groups - f.e. an HPC 
center offering services to several (many) universities won't be able 
to convince all its users to take time and think about data management 
and a virtual sucker rod is not as efficient as a real one, so 
policies would still be required...

-- 
Bogdan Costescu

IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany
Phone: +49 6221 54 8240, Fax: +49 6221 54 8850
E-mail: bogdan.costescu at iwr.uni-heidelberg.de



More information about the Beowulf mailing list