<div dir="ltr">Tom, as Reuti says let's have a look at the nature of these files.<div>what are they, and are analysis jobs really revisiting them again and again?</div><div><br></div><div>This is a marvellous tool for analysing filesystem usage:</div>


<div><a href="http://www.chiark.greenend.org.uk/~sgtatham/agedu/">http://www.chiark.greenend.org.uk/~sgtatham/agedu/</a><br></div><div><br></div><div>I have used it a lot in the past on the scratch storage of our clusters to highlight data which hadn't been used in ages.</div>


<div><br></div><div>I'm not sure how long agedu will take to index a large Lustre filesystem like yours, but it would be well worth having a try.</div><div><br></div><div>Agedu doesn't work on DMF filesystems (as it uses a stat ont he file, and migrated files would appear to be very small).</div>


<div><br></div><div><br></div><div><br></div></div><div class="gmail_extra"><br><br><div class="gmail_quote">On 12 June 2014 12:12, John Hearns <span dir="ltr"><<a href="mailto:hearnsj@googlemail.com" target="_blank">hearnsj@googlemail.com</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Tom, <div>I agree with you regarding small files.</div><div>In my case, I manage a DMF (SGI Data Migration Facility) setup.</div>


<div>I was concerned at the amount of small files which we were storing - in terms of the size of the database files, and storing small files to tape.</div>

<div>SGI engineers reassured me that the system will happily cope with millions of files, and does so on many sites.</div><div>DMF also waits till a large 'chunk' is to be written to tape, ie small writes are queued up.</div>


<div><br></div><div>However, when watching the amount of files being pushed to the tape tier one day I noticed something like 10 000 files or more  from one user.</div><div>Cue the application of a LART. </div><div>Seriously though - I did have a word and he agreed to zip up all the small PNG files his project was generating.</div>


<div><br></div><div>I have a general policy here that when lots of small files are generated then the directory is zipped up and the zip files is stored.</div><div>We have codes which generate lots of zip files which are stitched together into movies, and we also store wind tunnel data which is again</div>


<div>lots of PNG files. It is unlikely that anyone would ever want the raw data files again, but if they should do then an unzip is easy.</div><div class=""><div><br></div><div><br></div><div><span style="font-family:arial,sans-serif;font-size:13px">> Do you distinguish and segregate them (and/or the people that use them) on special</span><br style="font-family:arial,sans-serif;font-size:13px">


<span style="font-family:arial,sans-serif;font-size:13px">> hardware/filesystems?</span><br></div></div><div>Suggest you invest in a LART.  <a href="http://dictionary.reference.com/browse/lart" target="_blank">http://dictionary.reference.com/browse/lart</a></div>


<div><br></div><div><br></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><br><div class="gmail_quote">On 12 June 2014 11:43, Reuti <span dir="ltr"><<a href="mailto:reuti@staff.uni-marburg.de" target="_blank">reuti@staff.uni-marburg.de</a>></span> wrote:<br>


<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>

<br>

Am 11.06.2014 um 21:03 schrieb Tom Harvill:<br>

<div><br>

> This is my first time posting to this list, thanks in advance for any time you spend<br>

> replying.<br>

><br>

> We've found that a large majority of our files (~40MM of ~50MM) are less than 10KB.<br>

> We believe our filesystem (lustre) is bottlenecked with IOPs and locking related to<br>

> jobs running against these files.  We have ~700TB usable storage with ~500TB consumed,<br>

> almost all consumption is by a relatively small number of very very large files.<br>

<br>

</div>What data is represented in 10KB: binary or ASCII data - would it work to put it in a database instead of all these single files? How do you access the files: by some kind of index, name, directory...?<br>

<span><font color="#888888"><br>

-- Reuti<br>

</font></span><div><div><br>

<br>

> I want to ask this general question: how does your shop deal with the general problem of<br>

> small files in filesystems on (beowulf) compute clusters? Specifically, files that users expect<br>

> to actively use for read and write operations for their research.<br>

><br>

> Do you distinguish and segregate them (and/or the people that use them) on special<br>

> hardware/filesystems?<br>

><br>

> Thanks!<br>

> Tom<br>

><br>

> Tom Harvill<br>

> Holland Computing Center<br>

> University of Nebraska<br>

> _______________________________________________<br>

> Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

> To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</div></div></blockquote></div><br></div>

</div></div></blockquote></div><br></div>