<div dir="ltr"><div>While I agree with many points made so far I want to add that one aspect which used to separate a typical HPC setup from some IT infrastructure is complexity. And I don't mean technological complexity (because technologically HPC can be fairly complex) but the diversity and the interrelationships between various things. Typically HPC is relatively homogeneous and straightforward. But everything is changing including HPC so modularisation is a natural approach to make systems more manageable so containers, conda, kubernetes etc are solutions to fight complexity. Yes, these solutions can be fairly complex too but the impact is generally intentionally restricted. For example, a conda environment can be rather bloated but then flexibility for size is a reasonable trade-off.</div><div dir="ltr">One of the points Werner Vogels, Amazon CTO kept coming back over and over again in his keynote at the recent reInvent is modular (cellular) architecture at different levels (lambdas, firecracker, containers, VMs and up) because working with redundant, replaceable modules makes services scalable and resilient.</div><div>And I'm pretty sure the industry will continue on its path to embrace microVMs as it did containers before that.</div><div>This modular approach may work quite well for on prem IT, cloud or HTC (High Throughout Computing) but may still be a challenge for HPC because you can argue that true HPC system must be tightly coupled (e.g. remember OS jitter?)</div><div>As for ML and more specifically deep learning, it depends on what you do. If you are doing inferencing ie production setup ie more like HTC then everything works fine. But if you want to train a model on on ImageNet or larger and do it very quickly (hours) then you will benefit from a tightly coupled setup (although there are tricks such as asynchronous parameter updates to alleviate latency)</div><div>Two points in case here: Kubeflow whose scaling seems somewhat deficient and Horovod library which made many people rather excited because it allows using Tensorflow and MPI.</div><div>While Docker and Singularity can be used with MPI, you'd probably want to trim as much as you can if you want to push the scaling limit. But I think we've already discussed many times on this list the topic of "heroic" HPC vs "democratic" HPC (top vs tail).</div><div><br></div><div>Just on last thing regarding using GPUs in the cloud. Last time I checked even the spot instances were so expensive you'd be so much better off if you buy them even if for a month. Obviously if you have place to host them. And obviously in your DC you can use a decent network for faster training.</div><div>As for ML services provided by AWS and others, my experience rather limited. I helped one of our students with ML service on AWS. Initially he was excited that he could just through his data set at it and get something out. Alas, he quickly found out that he needs to do quite a bit more so back to our HPC. Perhaps AutoML will be significantly improved in the coming years but for now just expecting to get something good without an effort is probably premature.</div></div><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr">On Sun, 9 Dec 2018 at 15:26, Gerald Henriksen <<a href="mailto:ghenriks@gmail.com" target="_blank">ghenriks@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Fri, 7 Dec 2018 16:19:30 +0100, you wrote:<br>

<br>

>Perhaps for another thread:<br>

>Actually I went t the AWS USer Group in the UK on Wednesday. Ver<br>

>impressive, and there are the new Lustre filesystems and MPI networking.<br>

>I guess the HPC World will see the same philosophy of building your setup<br>

>using the AWS toolkit as Uber etc. etc. do today.<br>

>Also a lot of noise is being made at the moment about the convergence of<br>

>HPC and Machine Learning workloads.<br>

>Are we going to see the MAchine Learning folks adapting their workflows to<br>

>run on HPC on-premise bare metal clusters?<br>

>Or are we going to see them go off and use AWS (Azure, Google ?)<br>

<br>

I suspect that ML will not go for on-premise for a number of reasons.<br>

<br>

First, ignoring cost, companies like Google, Amazon and Microsoft are<br>

very good at ML because not only are they driving the research but<br>

they need it for their business.  So they have the in house expertise<br>

not only to implement cloud systems that are ideal for ML, but to<br>

implement custom hardware - see Google's Tensor Processor Unit.<br>

<br>

Second, setting up a new cluster isn't going to be easy.  Finding<br>

physical space, making sure enough utilities can be supplied to<br>

support the hardware, staffing up, etc.  are not only going to be<br>

difficult but inherently takes time when instead you can simply sign<br>

up to a cloud provider and have the project running within 24 hours.<br>

Would HPC exist today as we know it if the ability to instantly turn<br>

on a cluster existed at the beginning?<br>

<br>

Third, albeit this is very speculative.  I suspect ML learning is<br>

heading towards using custom hardware.  It has had a very good run<br>

using GPU's, and a GPU will likely always be the entry point for<br>

desktop ML, but unless Nvidia is holding back due to a lack of<br>

competition is does appear the GPU is reaching and end to its<br>

development much like CPUs have.  The latest hardware from Nvidia is<br>

getting lacklustre reviews, and the bolting on of additional things<br>

like raytracing is perhaps an indication that there are limits to how<br>

much further the GPU architecture can be pushed.  The question then is<br>

the ML market big enough to have that custom hardware as a OEM product<br>

like a GPU or will it remain restricted to places like Google who can<br>

afford to build it without the necessary overheads of a consumer<br>

product.<br>

_______________________________________________<br>

Beowulf mailing list, <a href="mailto:Beowulf@beowulf.org" target="_blank">Beowulf@beowulf.org</a> sponsored by Penguin Computing<br>

To change your subscription (digest mode or unsubscribe) visit <a href="http://www.beowulf.org/mailman/listinfo/beowulf" rel="noreferrer" target="_blank">http://www.beowulf.org/mailman/listinfo/beowulf</a><br>

</blockquote></div>