[Beowulf] Large amounts of data to store and process

Jonathan Aquilina jaquilina at eagleeyet.net
Mon Mar 11 06:20:33 PDT 2019


Lots of food for thought to munch on in this regard for sure __ 

I guess for now in my regards its hard to gauge performance gains with Julia and Juliadb give we are still prototyping the entire system and will eventually get into coding everything and stiching it all together. 

I for sure will give you guys an update once we get to that stage.

Regards,
Jonathan

On 11/03/2019, 14:18, "Douglas Eadline" <deadline at eadline.org> wrote:

    > Hi All,
    > Basically I have sat down with my colleague and we have opted to go down
    the route of Julia with JuliaDB for this project. But here is an
    interesting thought that I have been pondering if Julia is an up and
    coming fast language to work with for large amounts of data how will
    that
    > affect HPC and the way it is currently used and HPC systems created?
    
    
    First, IMO good choice.
    
    Second a short list of actual conversations.
    
    1) "This code is written in Fortran." I have been met with
    puzzling looks when I say the the word "Fortran." Then it
    comes, "... ancient language, why not port to modern ..."
    If you are asking that question young Padawan you have
    much to learn, maybe try web pages"
    
    2) I'll just use Python because it works on my Laptop.
    Later, "It will just run faster on a cluster, right?"
    and "My little Python program is now kind-of big and has
    become slow, should I use TensorFlow?"
    
    3) <mcoy>
    "Dammit Jim, I don't want to learn/write Fortran,C,C++ and MPI.
    I'm a (fill in  domain specific scientific/technical position)"
    </mcoy>
    
    My reply,"I agree and wish there was a better answer to that question.
    The computing industry has made great strides in HW with
    multi-core, clusters etc. Software tools have always lagged
    hardware. In the case of HPC it is a slow process and
    in HPC the whole programming "thing" is not as "easy" as
    it is in other sectors, warp drives and transporters
    take a little extra effort.
    
    4) Then I suggest Julia, "I invite you to try Julia. It is
    easy to get started, fast, and can grow with you application."
    Then I might say, "In a way it is HPC BASIC, it you are old
    enough you will understand what I mean by that."
    
    The question with languages like Julia (or Chapel, etc) is:
    
      "How much performance are you willing to give up for convenience?"
    
    The goal is to keep the programmer close to the problem at hand
    and away from the nuances of the underlying hardware. Obviously
    the more performance needed, the closer you need to get to the hardware.
    This decision goes beyond software tools, there are all kinds
    of cost/benefits that need to be considered. And, then there
    is IO ...
    
    --
    Doug
    
    
    
    
    
    
    
    > Regards,
    > Jonathan
    > -----Original Message-----
    > From: Beowulf <beowulf-bounces at beowulf.org> On Behalf Of Michael Di
    Domenico
    > Sent: 04 March 2019 17:39
    > Cc: Beowulf Mailing List <beowulf at beowulf.org>
    > Subject: Re: [Beowulf] Large amounts of data to store and process On
    Mon, Mar 4, 2019 at 8:18 AM Jonathan Aquilina
    <jaquilina at eagleeyet.net>
    > wrote:
    >> As previously mentioned we don’t really need to have anything indexed
    so I am thinking flat files are the way to go my only concern is the
    performance of large flat files.
    > potentially, there are many factors in the work flow that ultimately
    influence the decision as others have pointed out.  my flat file example
    is only one, where we just repeatable blow through the files.
    >> Isnt that what HDFS is for to deal with large flat files.
    > large is relative.  256GB file isn't "large" anymore.  i've pushed TB
    files through hadoop and run the terabyte sort benchmark, and yes it can
    be done in minutes (time-scale), but you need an astounding amount of
    hardware to do it (the last benchmark paper i saw, it was something 1000
    nodes).  you can accomplish the same feat using less and less
    complicated hardware/software
    > and if your dev's are willing to adapt to the hadoop ecosystem, you sunk
    right off the dock.
    > to get a more targeted answer from the numerous smart people on the
    list,
    > you'd need to open up the app and workflow to us.  there's just too many
    variables _______________________________________________
    > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    > http://www.beowulf.org/mailman/listinfo/beowulf
    > _______________________________________________
    > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    > http://www.beowulf.org/mailman/listinfo/beowulf
    
    
    -- 
    Doug
    
    
    
    
    



More information about the Beowulf mailing list