[Beowulf] Small files
hearnsj at googlemail.com
Thu Jun 12 04:22:13 PDT 2014
Tom, as Reuti says let's have a look at the nature of these files.
what are they, and are analysis jobs really revisiting them again and again?
This is a marvellous tool for analysing filesystem usage:
I have used it a lot in the past on the scratch storage of our clusters to
highlight data which hadn't been used in ages.
I'm not sure how long agedu will take to index a large Lustre filesystem
like yours, but it would be well worth having a try.
Agedu doesn't work on DMF filesystems (as it uses a stat ont he file, and
migrated files would appear to be very small).
On 12 June 2014 12:12, John Hearns <hearnsj at googlemail.com> wrote:
> I agree with you regarding small files.
> In my case, I manage a DMF (SGI Data Migration Facility) setup.
> I was concerned at the amount of small files which we were storing - in
> terms of the size of the database files, and storing small files to tape.
> SGI engineers reassured me that the system will happily cope with millions
> of files, and does so on many sites.
> DMF also waits till a large 'chunk' is to be written to tape, ie small
> writes are queued up.
> However, when watching the amount of files being pushed to the tape tier
> one day I noticed something like 10 000 files or more from one user.
> Cue the application of a LART.
> Seriously though - I did have a word and he agreed to zip up all the small
> PNG files his project was generating.
> I have a general policy here that when lots of small files are generated
> then the directory is zipped up and the zip files is stored.
> We have codes which generate lots of zip files which are stitched together
> into movies, and we also store wind tunnel data which is again
> lots of PNG files. It is unlikely that anyone would ever want the raw data
> files again, but if they should do then an unzip is easy.
> > Do you distinguish and segregate them (and/or the people that use them)
> on special
> > hardware/filesystems?
> Suggest you invest in a LART. http://dictionary.reference.com/browse/lart
> On 12 June 2014 11:43, Reuti <reuti at staff.uni-marburg.de> wrote:
>> Am 11.06.2014 um 21:03 schrieb Tom Harvill:
>> > This is my first time posting to this list, thanks in advance for any
>> time you spend
>> > replying.
>> > We've found that a large majority of our files (~40MM of ~50MM) are
>> less than 10KB.
>> > We believe our filesystem (lustre) is bottlenecked with IOPs and
>> locking related to
>> > jobs running against these files. We have ~700TB usable storage with
>> ~500TB consumed,
>> > almost all consumption is by a relatively small number of very very
>> large files.
>> What data is represented in 10KB: binary or ASCII data - would it work to
>> put it in a database instead of all these single files? How do you access
>> the files: by some kind of index, name, directory...?
>> -- Reuti
>> > I want to ask this general question: how does your shop deal with the
>> general problem of
>> > small files in filesystems on (beowulf) compute clusters? Specifically,
>> files that users expect
>> > to actively use for read and write operations for their research.
>> > Do you distinguish and segregate them (and/or the people that use them)
>> on special
>> > hardware/filesystems?
>> > Thanks!
>> > Tom
>> > Tom Harvill
>> > Holland Computing Center
>> > University of Nebraska
>> > _______________________________________________
>> > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin
>> > To change your subscription (digest mode or unsubscribe) visit
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf