[Beowulf] help for metadata-intensive jobs (imagenet)

Fri Jun 28 23:48:21 PDT 2019

Igor, if there are any papers published on what you are doing with these
images I would be very interested.
I went to the new London HPC and AI Meetup on Thursday, one talk was by
Odin Vision which was excellent.
Recommend the new Meetup to anyone in the area. Next meeting 21st August.

And a plug to Verne Global - they provided free Icelandic beer.

On Sat, 29 Jun 2019 at 05:43, INKozin via Beowulf <beowulf at beowulf.org>
wrote:

> Converting the files to TF records or similar would be one obvious
> approach if you are concerned about meta data. But then I d understand why
> some people would not want that (size, augmentation process). I assume you
> are are doing the training in a distributed fashion using MPI via Horovod
> or similar and it might be tempting to do file partitioning across the
> nodes. However doing so introduces a bias into minibatches (and custom
> preprocessing). If you partition carefully by mapping classes to nodes it
> may work but I also understand why some wouldn't be totally happy with
> that. Ive trained keras/TF/horovod models on imagenet using up to 6 nodes
> each with four p100/v100 and it worked reasonably well. As the training
> still took a few days copying to local NVMe disks was a good option.
> Hth
>
> On Fri, 28 Jun 2019, 18:47 Mark Hahn, <hahn at mcmaster.ca> wrote:
>
>> Hi all,
>> I wonder if anyone has comments on ways to avoid metadata bottlenecks
>> for certain kinds of small-io-intensive jobs.  For instance, ML on
>> imagenet,
>> which seems to be a massive collection of trivial-sized files.
>>
>> A good answer is "beef up your MD server, since it helps everyone".
>> That's a bit naive, though (no money-trees here.)
>>
>> How about things like putting the dataset into squashfs or some other
>> image that can be loop-mounted on demand?  sqlite?  perhaps even a format
>> that can simply be mmaped as a whole?
>>
>> personally, I tend to dislike the approach of having a job stage tons of
>> stuff onto node storage (when it exists) simply because that guarantees a
>> waste of cpu/gpu/memory resources for however long the stagein takes...
>>
>> thanks, mark hahn.
>> --
>> operator may differ from spokesperson.              hahn at mcmaster.ca
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://beowulf.org/pipermail/beowulf/attachments/20190629/91327938/attachment.html>