I have datasets from multiple instruments with differing, but hypothetically concurrent datetime stamps. If date from instrument A does not correspond to any data from instrument B within some interval, I want to flag those data for removal later.
The current way I handle this is to loop over A and test against B with a list comprehension. However, my datasets can get very large -- O(400k) -- which makes this looping on loops extremely slow and inefficient.
How can I reformulate the following "findGaps" method to be more efficient?
from datetime import datetime, timedelta
import time
start = datetime(2025,4,25,0,0,0)
stop = datetime(2025,4,25,2,0,0)
delta = timedelta(seconds=1)
# Main dataset:
dateTimeMain = []
while start <= stop:
start += delta
dateTimeMain.append(start)
# Test dataset with gaps:
dateTimeTest = dateTimeMain.copy()
del dateTimeTest[30:300]
del dateTimeTest[3000:3300]
def findGaps(dTM,dTT):
bTs = []
start = -1
i, index, stop = 0,0,0
tThreshold = timedelta(seconds=30)
for index, esTimeI in enumerate(dTM):
tDiff = [abs(x - esTimeI) for x in dTT]
if min(tDiff) > tThreshold:
i += 1
if start == -1:
start = index
stop = index
else:
if start != -1:
startstop = [dTM[start],dTM[stop]]
msg = f' Flag data from {startstop[0]} to {startstop[1]}'
print(msg)
bTs.append(startstop)
start = -1
if start != -1 and stop == index: # Records from a mid-point to the end are bad
startstop = [dTM[start],dTM[stop]]
bTs.append(startstop)
return bTs
tic = time.process_time()
badTimes = findGaps(dateTimeMain,dateTimeTest)
print(f'Loops: {len(dateTimeMain)} x {len(dateTimeTest)} = {len(dateTimeMain)*len(dateTimeTest)}')
print(f'Uncertainty Update Elapsed Time: {time.process_time() - tic:.3f} s')
Returns:
Flag data from 2025-04-25 00:01:01 to 2025-04-25 00:04:30
Flag data from 2025-04-25 00:55:01 to 2025-04-25 00:59:00
Loops: 7201 x 6631 = 47749831
Uncertainty Update Elapsed Time: 5.167 s
A simple way to make this faster is to vectorise the code with Numpy:
def findGaps_faster(dTM,dTT):
bTs = []
start = -1
i, index, stop = 0,0,0
tThreshold = timedelta(seconds=30)
dTT = np.array(dTT, dtype=np.datetime64)
for index, esTimeI in enumerate(dTM):
esTimeI = np.datetime64(esTimeI)
tDiff = np.abs(dTT - esTimeI)
if np.min(tDiff) > tThreshold:
i += 1
if start == -1:
start = index
stop = index
else:
if start != -1:
startstop = [dTM[start],dTM[stop]]
msg = f' Flag data from {startstop[0]} to {startstop[1]}'
print(msg)
bTs.append(startstop)
start = -1
if start != -1 and stop == index: # Records from a mid-point to the end are bad
startstop = [dTM[start],dTM[stop]]
bTs.append(startstop)
return bTs
This is 20 times faster and provide the same result.
The current algorithm is inefficient. Indeed, the O(n)
computation time for each item of dTM
is expensive and results in a O(n m)
time for the whole code. There is a way to make the algorithm significantly more efficient by sorting data. Indeed, you can find the closest value (i.e. np.min(np.abs(dTT - esTimeI))
with a binary search running in O(log n)
time, so the overall execution time goes down to O(m log n)
. This operation can also still be vectorised with Numpy thanks to np.searchsorted
:
def findGaps_fastest(dTM,dTT):
bTs = []
start = -1
i, index, stop = 0,0,0
tThreshold = timedelta(seconds=30)
# See below for faster conversions
np_dTT = np.array(dTT, dtype=np.datetime64)
np_dTT.sort()
np_dTM = np.array(dTM, dtype=np.datetime64)
pos = np.searchsorted(np_dTT, np_dTM, side='right')
# Consider the 3 items close the the position found.
# We can probably only consider 2 of them but this is simpler and less bug-prone.
pos1 = np.maximum(pos-1, 0)
pos2 = np.minimum(pos, np_dTT.size-1)
pos3 = np.minimum(pos+1, np_dTT.size-1)
tDiff1 = np.abs(np_dTT[pos1] - np_dTM)
tDiff2 = np.abs(np_dTT[pos2] - np_dTM)
tDiff3 = np.abs(np_dTT[pos3] - np_dTM)
tMin = np.minimum(tDiff1, tDiff2, tDiff3)
for index in range(len(np_dTM)):
if tMin[index] > tThreshold:
i += 1
if start == -1:
start = index
stop = index
else:
if start != -1:
startstop = [dTM[start],dTM[stop]]
msg = f' Flag data from {startstop[0]} to {startstop[1]}'
print(msg)
bTs.append(startstop)
start = -1
if start != -1 and stop == index: # Records from a mid-point to the end are bad
startstop = [dTM[start],dTM[stop]]
bTs.append(startstop)
return bTs
This is about 90 times faster!
Note that most of the time is actually spent in the conversions from lists to Numpy arrays. Without that (i.e. assuming it can be pre-computed), it would be even faster (>200 times)! The rest of the time is mainly spent in the Python loop.
If this is not enough, then you can do the conversion in parallel with multiple threads. You can certainly do that in Cython. Cython can also make the Python loop faster. In the end, most of the time should be spent in reading pure-Python object from the pure-Python list.
A convoluted way to make the conversion faster is to first extract each timestamp as a float
and then create the Numpy array quickly based on them. This is about 6 times faster on my machine and 90% of the time is spent in pure-Python datetime method calls. The precision of the date should be in the order of microseconds (not more). Here is an example:
timestamps = [e.timestamp() for e in dTT] # Expensive part
np_dTT = np.array(np.array(timestamps, np.float64), dtype='datetime64[s]')