[Beowulf] [External] Spark, Julia, OpenMPI etc. - all in one place

Tue Oct 13 05:25:19 PDT 2020

On 10/13/20 3:12 PM, Oddo Da wrote:
> Jim, Peter: by things have not changed in the tooling I meant that it is 
> the same approach/paradigm as it was when I was in HPC back in the late 
> 1990s/early 2000s. Even if you look at books about OpenMPI, you can go 
> on their mailing list and ask what books to read and you will be pointed 
> to the same stuff published 20+ years ago and maybe there are one or two 
> books that are "fresher" than that (I did that a few months ago, naively 
> thinking that things have changed ;-) ).
> 
> The approach is still the same - you have to write the code at the low 
> level and worry about everything. It would be nice if this was improved 
> and things were abstracted up and away a bit. The appearance of Spark, 
> for example, did exactly that for data science/machine learning/"big 
> data" - esp. when you write it in Scala (functional programming) - it 
> just makes for all sorts of cleaner, abstracted, more correct code where 
> the framework worries about the underlying data/computation locality, 
> the communication between all the machinery etc. etc. and you are left 
> to worry about the problem you are solving. I just feel that in the HPC 
> world we have not moved to this point yet and am trying to understand why.
> 
> I mean, let's say I was a data science researcher at a university and 
> all that was on offer was the traditional HPC cluster - what tooling 
> would I use to do my research? The whole world is doing something else 
> but I am stuck worrying about the low level details.... or I need to ask 
> for a separate HDFS/Spark cluster? What if I want to stream data from 
> somewhere like it is done commonly in the industry (solutions like Kafka 
> etc.) - my only option is to stand up a local cluster (costs time, 
> money, ongoing admin/maintenance) or to go to AWS or Azure and spend tax 
> payer money to fill corporate coffers for what should a;ready be a 
> solved problem with the money that was spent for all the hardware at the 
> University already?
> 
> BTW, Spark is just an example of how tooling/methodologies have improved 
> in the industry in the domain of distributed computation. This is why I 
> thought that Julia may be one of those things that provides a different 
> (improved?) way of doing things where both the climate modeling guys and 
> the data science guys can utilize the same HPC hardware....
> 
A number of countries have national infrastructures. Small and moderate 
allocations on XSEDE or similar allow people some experience with HPC 
without their institution investing in significant computational 
resources. The problem is usually, that knowledge on using these 
resources may then be scarce at an institution without any HPC resources.

A typical university cluster can run a data science workload with Spark, 
Hadoop etc., just requires Admins to make this possible. Systems like 
Comet are made for this kind of work:
https://portal.xsede.org/sdsc-comet

Once jobs start using tens to hundreds of thousands of cores hours, 
taxpayer money (probably also the environment) is saved by writing in a 
low level language.

A small number of countries design entirely new systems and train their 
students write/port software for them - much as happened with bleeding 
edge systems 20 years ago:)