[Beowulf] Large amounts of data to store and process

Mon Mar 4 06:27:52 PST 2019

On 3/4/19 1:55 AM, Jonathan Aquilina wrote:
> Hi Tony,
>
> Sadly I cant go into much detail due to me being under an NDA. At this point with the prototype we have around 250gb of sample data but again this data is dependent on the type of air craft. Larger aircraft and longer flights will generate a lot more data as they have  more sensors and will log more data than the sample data that I have. The sample data is 250gb for 35 aircraft of the same type.

You need to return your answers in ~10m or 600s, with an assumed data 
set size of 250GB or more (assuming you meant GB and not Gb).  Depending 
upon the nature of the calculation, whether or not you can perform the 
calculations on subsets, or if it requires multiple passes through the 
data in order to calculate.

I've noticed some recommendations popping up ahead of understanding what 
the rate limiting factors for returning the results from calculations 
based upon this data set.  I'd suggest focusing on the analysis needs to 
start, as this will provide some level of guidance on the system(s) 
design required to meet your objectives.

First off, do you know whether your code will meet this 600s response 
time with this 250GB data set?  I am assuming this is unknown at this 
moment, but if you have response time data for smaller data sets, you 
could construct a rough scaling study and build a simple predictive model.

Second, do you need the entire bolus of data, all 250GB, in order to 
generate a response to within required accuracy?  If not, great, and 
what size do you need?

Third, will this data set grow over time (looking at your writeup, it 
looks like this is a definite "yes")?

Fourth, does the code require physical access to all of the data bolus 
(what is needed for the calculation) locally in order to correctly operate?

Fifth, will the data access patterns for the code be streaming, 
searching, or random?  In only one of these cases would a database (SQL 
or noSQL) be a viable option.

Sixth, is your working data set size comparable to the bolus size (e.g. 
250GB)?

Seventh, can your code work correctly with sharded data (variation on 
second point)?

Now some brief "data physics".

a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once, 
assuming large block sequential read.  For a 600s response time, that 
leaves you with 350s to calculate.  Is this enough time?  Is a single 
pass (streaming) workable?

b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once in 
parallel amongst multiple cores.  If multiple/many passes through data 
are required, this strongly suggests a large memory machine (512GB or 
larger).

c) if your data is shardable, and you can distribute it amongst N 
machines, the above analyses still hold, replacing the 250GB with the 
size of the shards.  If you can do this, how much information does your 
code need to share amongst the worker nodes in order to effect the 
calculation?  This will provide guidance on interconnect choices.

Basically, I am advocating focusing on the analysis needs, how the 
scale/grow, and your near/medium/long term goals with this, before you 
commit to a specific design/implementation.  Avoid the "if all you have 
is a hammer, every problem looks like a nail" view as much as possible.

-- 
Joe Landman
e: joe.landman at gmail.com
t: @hpcjoe
w: https://scalability.org
g: https://github.com/joelandman
l: https://www.linkedin.com/in/joelandman