[Beowulf] ***UNCHECKED*** Re: Spark, Julia, OpenMPI etc. - all in one place

Douglas Eadline deadline at eadline.org
Tue Oct 13 06:55:00 PDT 2020


I have noticed a lot of Hadoop/Spark references in the replies.
The word "Hadoop" is probably the most misunderstood
word in computing today and may people have a somewhat
vague idea what it actually is.

Hadoop V1 was a monolithic Map Reduce framework written in
Java. (BTW Map Reduce is a SIMD algorithm)

Hadoop V2 the Map Reduce component was separated from the
scheduler (YARN) and the underlying distributed file systems (HDFS)
It is best thought of as a "platform" for developing big
data systems. The most popular Map Reduce application is Hive.
Developed by Facebook, it allow relational databases to be run
at scale.

Hadoop V3 and beyond is moving more toward a true cloud based
environment with a new file systems called Ozone. Note, the need
for HDFS made cloud migration difficult

Spark is a completely separate code base that has its own Map Reduce
engine. It can work stand-alone, with the YARN scheduler, or with
other schedulers. It can also take advantage of HDFS.

Spark is language, Hadoop is platform. Map Reduce is SIMD
algorithm that works well with large amounts of read-only
data.

There is more to it, but that is the gist of it.

--
Doug

> Hello,
>
> I used to be in HPC back when we built beowulf clusters by hand ;) and
> wrote code in C/pthreads, PVM and MPI and back when anyone could walk into
> fields like bioinformatics, all that was needed was a pulse, some C and
> Perl and a desire to do ;-). Then I left for the private sector and
> stumbled into "big data" some years later - I wrote a lot of code in Spark
> and Scala, worked in infrastructure to support it etc.
>
> Then I went back (in 2017) to HPC. I was surprised to find that not much
> has changed - researchers and grad students still write code in MPI and
> C/C++ and maybe some Python or R for visualization or localized data
> analytics. I also noticed that it was not easy to "marry" things like big
> data with HPC clusters - tools like Spark/Hadoop do not really have the
> same underlying infrastructure assumptions as do things like
> MPI/supercomputers. However, I find it wasteful for a university to run
> separate clusters to support a data science/big data load vs traditional
> HPC.
>
> I then stumbled upon languages like Julia - I like its approach, code is
> data, visualization is easy, decent ML/DS tooling.
>
> How does it fare on a traditional HCP cluster? Are people using it to
> substitute their MPI loads? On the opposite side, has it caught up to
> Spark
> in terms of DS/ML quality of offering? In other words, can it be used as a
> one fell swoop unifying substitute for both opposing approaches?
>
> I realize that many people have already committed to certain
> tech/paradigms
> but this is mostly educational debt (if MPI or Spark on the other side is
> working for me, why go to something different?) - but is there anything
> substantial stopping new people with no debt starting out in a different
> approach (offerings like Julia)?
>
> I do not have too much experience with Julia (and hence may be barking at
> the wrong tree) - in that case I am wondering what people are doing to
> "marry" the loads of traditional HPC with "big data" as practiced by the
> commercial/industry entities on a single underlying hardware offering. I
> know there are things like Twister2 but it is unclear to me (from cursory
> examination) what it actually offers in the context of my questions above.
>
> Any input, corrections, schooling me etc. are appreciated.
>
> Thank you!
> _______________________________________________
> Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> https://beowulf.org/cgi-bin/mailman/listinfo/beowulf
>


-- 
Doug



More information about the Beowulf mailing list