[Beowulf] HPC workflows

Sun Dec 9 12:43:31 PST 2018

 > but for now just expecting to get something good without an effort is
probably premature.

Nothing good every came easy.

Who said that? My Mum. And she was a very wise woman.

On Sun, 9 Dec 2018 at 21:36, INKozin via Beowulf <beowulf at beowulf.org>
wrote:

> While I agree with many points made so far I want to add that one aspect
> which used to separate a typical HPC setup from some IT infrastructure is
> complexity. And I don't mean technological complexity (because
> technologically HPC can be fairly complex) but the diversity and the
> interrelationships between various things. Typically HPC is relatively
> homogeneous and straightforward. But everything is changing including HPC
> so modularisation is a natural approach to make systems more manageable so
> containers, conda, kubernetes etc are solutions to fight complexity. Yes,
> these solutions can be fairly complex too but the impact is generally
> intentionally restricted. For example, a conda environment can be rather
> bloated but then flexibility for size is a reasonable trade-off.
> One of the points Werner Vogels, Amazon CTO kept coming back over and over
> again in his keynote at the recent reInvent is modular (cellular)
> architecture at different levels (lambdas, firecracker, containers, VMs and
> up) because working with redundant, replaceable modules makes services
> scalable and resilient.
> And I'm pretty sure the industry will continue on its path to embrace
> microVMs as it did containers before that.
> This modular approach may work quite well for on prem IT, cloud or HTC
> (High Throughout Computing) but may still be a challenge for HPC because
> you can argue that true HPC system must be tightly coupled (e.g. remember
> OS jitter?)
> As for ML and more specifically deep learning, it depends on what you do.
> If you are doing inferencing ie production setup ie more like HTC then
> everything works fine. But if you want to train a model on on ImageNet or
> larger and do it very quickly (hours) then you will benefit from a tightly
> coupled setup (although there are tricks such as asynchronous parameter
> updates to alleviate latency)
> Two points in case here: Kubeflow whose scaling seems somewhat deficient
> and Horovod library which made many people rather excited because it allows
> using Tensorflow and MPI.
> While Docker and Singularity can be used with MPI, you'd probably want to
> trim as much as you can if you want to push the scaling limit. But I think
> we've already discussed many times on this list the topic of "heroic" HPC
> vs "democratic" HPC (top vs tail).
>
> Just on last thing regarding using GPUs in the cloud. Last time I checked
> even the spot instances were so expensive you'd be so much better off if
> you buy them even if for a month. Obviously if you have place to host them.
> And obviously in your DC you can use a decent network for faster training.
> As for ML services provided by AWS and others, my experience rather
> limited. I helped one of our students with ML service on AWS. Initially he
> was excited that he could just through his data set at it and get something
> out. Alas, he quickly found out that he needs to do quite a bit more so
> back to our HPC. Perhaps AutoML will be significantly improved in the
> coming years but for now just expecting to get something good without an
> effort is probably premature.
>
>
> On Sun, 9 Dec 2018 at 15:26, Gerald Henriksen <ghenriks at gmail.com> wrote:
>
>> On Fri, 7 Dec 2018 16:19:30 +0100, you wrote:
>>
>> >Perhaps for another thread:
>> >Actually I went t the AWS USer Group in the UK on Wednesday. Ver
>> >impressive, and there are the new Lustre filesystems and MPI networking.
>> >I guess the HPC World will see the same philosophy of building your setup
>> >using the AWS toolkit as Uber etc. etc. do today.
>> >Also a lot of noise is being made at the moment about the convergence of
>> >HPC and Machine Learning workloads.
>> >Are we going to see the MAchine Learning folks adapting their workflows
>> to
>> >run on HPC on-premise bare metal clusters?
>> >Or are we going to see them go off and use AWS (Azure, Google ?)
>>
>> I suspect that ML will not go for on-premise for a number of reasons.
>>
>> First, ignoring cost, companies like Google, Amazon and Microsoft are
>> very good at ML because not only are they driving the research but
>> they need it for their business.  So they have the in house expertise
>> not only to implement cloud systems that are ideal for ML, but to
>> implement custom hardware - see Google's Tensor Processor Unit.
>>
>> Second, setting up a new cluster isn't going to be easy.  Finding
>> physical space, making sure enough utilities can be supplied to
>> support the hardware, staffing up, etc.  are not only going to be
>> difficult but inherently takes time when instead you can simply sign
>> up to a cloud provider and have the project running within 24 hours.
>> Would HPC exist today as we know it if the ability to instantly turn
>> on a cluster existed at the beginning?
>>
>> Third, albeit this is very speculative.  I suspect ML learning is
>> heading towards using custom hardware.  It has had a very good run
>> using GPU's, and a GPU will likely always be the entry point for
>> desktop ML, but unless Nvidia is holding back due to a lack of
>> competition is does appear the GPU is reaching and end to its
>> development much like CPUs have.  The latest hardware from Nvidia is
>> getting lacklustre reviews, and the bolting on of additional things
>> like raytracing is perhaps an indication that there are limits to how
>> much further the GPU architecture can be pushed.  The question then is
>> the ML market big enough to have that custom hardware as a OEM product
>> like a GPU or will it remain restricted to places like Google who can
>> afford to build it without the necessary overheads of a consumer
>> product.
>> _______________________________________________
>> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.beowulf.org/pipermail/beowulf/attachments/20181209/ee13c6f2/attachment-0001.html>