Archives


- Beowulf
- Beowulf Announce
- Scyld-users
- Beowulf on Debian

[Beowulf] filesystem metadata mining tools

Many of your questions may have already been answered in earlier discussions or in the FAQ. The search results page will indicate current discussions as well as past list serves, articles, and papers.

Search

Lux, Jim (337C) james.p.lux at jpl.nasa.gov
Sat Sep 12 16:02:10 PDT 2009




On 9/12/09 8:10 AM, "Rahul Nabar" <rpnabar at gmail.com> wrote:

> As the number of total files on our server was exploding (~2.5 million
> / 1 Terabyte) I
> wrote a simple shell script that used find to tell me which users have how
> many. So far so good.
> 
> But I want to drill down more:
> 
> *Are there lots of duplicate files? I suspect so. Stuff like job submission
> scripts which users copy rather than link etc. (fdupes seems puny for
> a job of this scale)
> 
> *What is the most common file (or filename)
> 
> *A distribution of filetypes (executibles; netcdf; movies; text) and
> prevalence.
> 
> *A distribution of file age and prevelance (to know how much of this
> material is archivable). Same for frequency of access; i.e. maybe the last
> access stamp.
> 
> * A file size versus number plot. i.e. Is 20% of space occupied by 80% of
> files? etc.
> 

Another useful application for such a tool would be to get better KLOC
counts of source code trees.  I find that our trees have lots of duplication
among branches (e.g. Everyone has a "test.c" for unit test in with their
modules, and all of them are pretty similar)





More information about the Beowulf mailing list