[Beowulf] Large amounts of data to store and process

Jonathan Aquilina jaquilina at eagleeyet.net
Mon Mar 4 05:18:46 PST 2019


Hi Michael,

As previously mentioned we don’t really need to have anything indexed so I am thinking flat files are the way to go my only concern is the performance of large flat files. Isnt that what HDFS is for to deal with large flat files.

On 04/03/2019, 14:13, "Beowulf on behalf of Michael Di Domenico" <beowulf-bounces at beowulf.org on behalf of mdidomenico4 at gmail.com> wrote:

    even though you've alluded to this being time series data.  is there a
    requirement that you have to index into the data or is just read the
    data end-to-end and do some calculations.
    
    i routinely face these kind of issues, but we're not indexing into the
    data, so having things in hdfs or rdbms doesn't give us any benefit.
    we pull all the data into organized flat files and blow through them
    with HTCondor.  if the researcher wants to tweak the code they do and
    then just rerun the whole simulation.
    
    sometimes that's minutes sometimes days.  but in either case the time
    to develop code is always much shorter because the data is in flat
    files and easier for my "non-programmer" programmers.  no need to
    learn hdfs/hadoop or sql
    
    if you need to index the data and jump around, hdfs is probably still
    not the best solution unless you want index the files and 250gb isn't
    really big enough to warrant an hdfs cluster.  i've generally found
    unless you're dealing with multi-TB+ datasets you can't scale the
    hardware out enough to get the speed up.  (yes, i know there are
    tweaks to change this, but I've found its just simpler to buy a bigger
    lustre system)
    
    
    
    On Mon, Mar 4, 2019 at 1:39 AM Jonathan Aquilina
    <jaquilina at eagleeyet.net> wrote:
    >
    > Good Morning all,
    >
    >
    >
    > I am working on a project that I sadly cant go into much detail but there will be quite large amounts of data that will be ingested by this system and would need to be efficiently returned as output to the end user in around 10 min or so. I am in discussions with another partner involved in this project about the best way forward on this.
    >
    >
    >
    > For me given the amount of data (and it is a huge amount of data) that an RDBMS such as postgresql would be a major bottle neck. Another thing that was considered flat files, and I think the best for that would be a Hadoop cluster with HDFS. But in the case of HPC how can such an environment help in terms of ingesting and analytics of large amounts of data? Would said flat files of data be put on a SAN/NAS or something and through an NFS share accessed that way for computational purposes?
    >
    >
    >
    > Regards,
    >
    > Jonathan
    >
    > _______________________________________________
    > Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    > To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
    _______________________________________________
    Beowulf mailing list, Beowulf at beowulf.org sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
    



More information about the Beowulf mailing list