[Beowulf] Large amounts of data to store and process
joe.landman at gmail.com
Mon Mar 4 06:27:52 PST 2019
On 3/4/19 1:55 AM, Jonathan Aquilina wrote:
> Hi Tony,
> Sadly I cant go into much detail due to me being under an NDA. At this point with the prototype we have around 250gb of sample data but again this data is dependent on the type of air craft. Larger aircraft and longer flights will generate a lot more data as they have more sensors and will log more data than the sample data that I have. The sample data is 250gb for 35 aircraft of the same type.
You need to return your answers in ~10m or 600s, with an assumed data
set size of 250GB or more (assuming you meant GB and not Gb). Depending
upon the nature of the calculation, whether or not you can perform the
calculations on subsets, or if it requires multiple passes through the
data in order to calculate.
I've noticed some recommendations popping up ahead of understanding what
the rate limiting factors for returning the results from calculations
based upon this data set. I'd suggest focusing on the analysis needs to
start, as this will provide some level of guidance on the system(s)
design required to meet your objectives.
First off, do you know whether your code will meet this 600s response
time with this 250GB data set? I am assuming this is unknown at this
moment, but if you have response time data for smaller data sets, you
could construct a rough scaling study and build a simple predictive model.
Second, do you need the entire bolus of data, all 250GB, in order to
generate a response to within required accuracy? If not, great, and
what size do you need?
Third, will this data set grow over time (looking at your writeup, it
looks like this is a definite "yes")?
Fourth, does the code require physical access to all of the data bolus
(what is needed for the calculation) locally in order to correctly operate?
Fifth, will the data access patterns for the code be streaming,
searching, or random? In only one of these cases would a database (SQL
or noSQL) be a viable option.
Sixth, is your working data set size comparable to the bolus size (e.g.
Seventh, can your code work correctly with sharded data (variation on
Now some brief "data physics".
a) (data on durable storage) 250GB @ 1GB/s -> 250s to read, once,
assuming large block sequential read. For a 600s response time, that
leaves you with 350s to calculate. Is this enough time? Is a single
pass (streaming) workable?
b) (data in ram) 250GB/s @ 100GB/s -> 2.5s to walk through once in
parallel amongst multiple cores. If multiple/many passes through data
are required, this strongly suggests a large memory machine (512GB or
c) if your data is shardable, and you can distribute it amongst N
machines, the above analyses still hold, replacing the 250GB with the
size of the shards. If you can do this, how much information does your
code need to share amongst the worker nodes in order to effect the
calculation? This will provide guidance on interconnect choices.
Basically, I am advocating focusing on the analysis needs, how the
scale/grow, and your near/medium/long term goals with this, before you
commit to a specific design/implementation. Avoid the "if all you have
is a hammer, every problem looks like a nail" view as much as possible.
e: joe.landman at gmail.com
More information about the Beowulf