Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] filesystem metadata mining tools

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Prentice Bisbal prentice at ias.edu
Tue Sep 15 11:26:56 PDT 2009


I have used perl in the past to gather summaries of file usage like
this. The details are fuzzy(it was a couple of years ago), but I think I
did a 'find -ls' to a text file and then used perl to parse the file and
and add up the various statistics. I wasn't gathering as many statistics
as you, but it was pretty easy to write for a novice perl programmer
like me.

Prentice

Rahul Nabar wrote:
> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
> 
> But I want to drill down more:
> 
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
> 
> *What is the most common file (or filename)
> 
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
> 
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
> 
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.
> 
> I've used cushion plots in the past (sequiaview; pydirstat) but those
> seem more desktop oriented than suitable for a job like this.
> 
> Essentially I want to data mine my file usage to strategize. Are there any
> tools for this? Writing a new find each time seems laborious.
> 
> I suspect forensics might also help identify anomalies in usage across
> users which might be indicative of other maladies. e.g. a user who had a
> runaway job write a 500GB file etc.
> 
> Essentially are there any "filesystem metadata mining tools"?
> 



More information about the Beowulf mailing list