[Beowulf] HPC workflows
i.n.kozin at googlemail.com
Sun Dec 9 12:29:19 PST 2018
While I agree with many points made so far I want to add that one aspect
which used to separate a typical HPC setup from some IT infrastructure is
complexity. And I don't mean technological complexity (because
technologically HPC can be fairly complex) but the diversity and the
interrelationships between various things. Typically HPC is relatively
homogeneous and straightforward. But everything is changing including HPC
so modularisation is a natural approach to make systems more manageable so
containers, conda, kubernetes etc are solutions to fight complexity. Yes,
these solutions can be fairly complex too but the impact is generally
intentionally restricted. For example, a conda environment can be rather
bloated but then flexibility for size is a reasonable trade-off.
One of the points Werner Vogels, Amazon CTO kept coming back over and over
again in his keynote at the recent reInvent is modular (cellular)
architecture at different levels (lambdas, firecracker, containers, VMs and
up) because working with redundant, replaceable modules makes services
scalable and resilient.
And I'm pretty sure the industry will continue on its path to embrace
microVMs as it did containers before that.
This modular approach may work quite well for on prem IT, cloud or HTC
(High Throughout Computing) but may still be a challenge for HPC because
you can argue that true HPC system must be tightly coupled (e.g. remember
As for ML and more specifically deep learning, it depends on what you do.
If you are doing inferencing ie production setup ie more like HTC then
everything works fine. But if you want to train a model on on ImageNet or
larger and do it very quickly (hours) then you will benefit from a tightly
coupled setup (although there are tricks such as asynchronous parameter
updates to alleviate latency)
Two points in case here: Kubeflow whose scaling seems somewhat deficient
and Horovod library which made many people rather excited because it allows
using Tensorflow and MPI.
While Docker and Singularity can be used with MPI, you'd probably want to
trim as much as you can if you want to push the scaling limit. But I think
we've already discussed many times on this list the topic of "heroic" HPC
vs "democratic" HPC (top vs tail).
Just on last thing regarding using GPUs in the cloud. Last time I checked
even the spot instances were so expensive you'd be so much better off if
you buy them even if for a month. Obviously if you have place to host them.
And obviously in your DC you can use a decent network for faster training.
As for ML services provided by AWS and others, my experience rather
limited. I helped one of our students with ML service on AWS. Initially he
was excited that he could just through his data set at it and get something
out. Alas, he quickly found out that he needs to do quite a bit more so
back to our HPC. Perhaps AutoML will be significantly improved in the
coming years but for now just expecting to get something good without an
effort is probably premature.
On Sun, 9 Dec 2018 at 15:26, Gerald Henriksen <ghenriks at gmail.com> wrote:
> On Fri, 7 Dec 2018 16:19:30 +0100, you wrote:
> >Perhaps for another thread:
> >Actually I went t the AWS USer Group in the UK on Wednesday. Ver
> >impressive, and there are the new Lustre filesystems and MPI networking.
> >I guess the HPC World will see the same philosophy of building your setup
> >using the AWS toolkit as Uber etc. etc. do today.
> >Also a lot of noise is being made at the moment about the convergence of
> >HPC and Machine Learning workloads.
> >Are we going to see the MAchine Learning folks adapting their workflows to
> >run on HPC on-premise bare metal clusters?
> >Or are we going to see them go off and use AWS (Azure, Google ?)
> I suspect that ML will not go for on-premise for a number of reasons.
> First, ignoring cost, companies like Google, Amazon and Microsoft are
> very good at ML because not only are they driving the research but
> they need it for their business. So they have the in house expertise
> not only to implement cloud systems that are ideal for ML, but to
> implement custom hardware - see Google's Tensor Processor Unit.
> Second, setting up a new cluster isn't going to be easy. Finding
> physical space, making sure enough utilities can be supplied to
> support the hardware, staffing up, etc. are not only going to be
> difficult but inherently takes time when instead you can simply sign
> up to a cloud provider and have the project running within 24 hours.
> Would HPC exist today as we know it if the ability to instantly turn
> on a cluster existed at the beginning?
> Third, albeit this is very speculative. I suspect ML learning is
> heading towards using custom hardware. It has had a very good run
> using GPU's, and a GPU will likely always be the entry point for
> desktop ML, but unless Nvidia is holding back due to a lack of
> competition is does appear the GPU is reaching and end to its
> development much like CPUs have. The latest hardware from Nvidia is
> getting lacklustre reviews, and the bolting on of additional things
> like raytracing is perhaps an indication that there are limits to how
> much further the GPU architecture can be pushed. The question then is
> the ML market big enough to have that custom hardware as a OEM product
> like a GPU or will it remain restricted to places like Google who can
> afford to build it without the necessary overheads of a consumer
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Beowulf