[Beowulf] Large amounts of data to store and process
Lux, Jim (337K)
James.P.Lux at jpl.nasa.gov
Tue Mar 5 14:38:48 PST 2019
From: Beowulf [mailto:beowulf-bounces at beowulf.org] On Behalf Of Joe Landman
Sent: Tuesday, March 05, 2019 6:32 AM
To: beowulf at beowulf.org
Subject: Re: [Beowulf] Large amounts of data to store and process
On 3/4/19 8:00 PM, Lux, Jim (337K) via Beowulf wrote:
> I'm munging through not very much satellite telemetry (a few GByte), using sqlite3..
> Here's some general observations:
> 1) if the data is recorded by multiple sensor systems, the clocks will *not* align - sure they may run NTP, but....
> 2) Typically there's some sort of raw clock being recorded with the data (in ticks of some oscillator, typically) - that's what you can use to put data from a particular batch of sources into a time order. And then you have the problem of reconciling the different clocks.
> 3) Watch out for leap seconds in time stamps - some systems have them (UTC), some do not (GPS, TAI) - a time of 23:59:60 may be legal.
> 4) you need to have a way to deal with "missing" data, whether it's time tags, or actual measurements - as well as "gaps in the record"
> 5) Be aware of the need to de-dupe data - same telemetry records from multiple sources.
Being satellite data, I am assuming you have relativistic corrections to the time, depending upon orbit, accuracy of the clock, data analysis needs, etc. . 
Missing data, of various types may be handled in data frame packages. R, Julia, and I think Python can all handle this without too much pain.
No need to deal with relativity in these cases, but propagation delay, certainly.
Typically, the issue is reconciling multiple telemetry streams which have recording rates of <1000Hz, where the systems doing the recording do not have synchronized clocks.
A lot of spacecraft debugging is checking "did message A leave box B and arrive at box C", where Box B and Box C are timestamping the events. So you look and see if you have a Sending Message A on the Box B log and a Receiving Message A on the Box C log, in the right order, and with the right delay - if Box B is on the Moon and Box C is on Earth, then one expects roughly a 1 second delay.
One would like to do operations like "give me the telemetry from 12:01:00Z to 12:03:00Z" but there might not be a GPS valid time in that range, so you have to estimate it from the local clock, an estimate of the clock rate, and a known GPS time hack outside that range. The challenge comes in where the boxes have clocks that run at different rates - 50 ppm error is 4 seconds/day. Or multiple clocks (you might have a local clock and also a GPS, but the GPS doesn't always work).
There's probably a python library to handle this, but I've not found it yet.
More information about the Beowulf