pythonperformancedatetimetimedelta

How do you efficiently find gaps in one list of datetimes relative to another?


I have datasets from multiple instruments with differing, but hypothetically concurrent datetime stamps. If date from instrument A does not correspond to any data from instrument B within some interval, I want to flag those data for removal later.

The current way I handle this is to loop over A and test against B with a list comprehension. However, my datasets can get very large -- O(400k) -- which makes this looping on loops extremely slow and inefficient.

How can I reformulate the following "findGaps" method to be more efficient?

from datetime import datetime, timedelta
import time

start = datetime(2025,4,25,0,0,0)
stop = datetime(2025,4,25,2,0,0)
delta = timedelta(seconds=1)

# Main dataset:
dateTimeMain = []
while start <= stop:
    start += delta
    dateTimeMain.append(start)

# Test dataset with gaps:
dateTimeTest = dateTimeMain.copy()
del dateTimeTest[30:300]
del dateTimeTest[3000:3300]

def findGaps(dTM,dTT):
    bTs = []
    start = -1
    i, index, stop = 0,0,0
    tThreshold = timedelta(seconds=30)

    for index, esTimeI in enumerate(dTM):

        tDiff = [abs(x - esTimeI) for x in dTT]
        if min(tDiff) > tThreshold:
            i += 1
            if start == -1:
                start = index
            stop = index
        else:
            if start != -1:
                startstop = [dTM[start],dTM[stop]]
                msg = f'   Flag data from {startstop[0]} to {startstop[1]}'
                print(msg)
                bTs.append(startstop)
                start = -1

    if start != -1 and stop == index: # Records from a mid-point to the end are bad
        startstop = [dTM[start],dTM[stop]]
        bTs.append(startstop)

    return bTs

tic = time.process_time()
badTimes = findGaps(dateTimeMain,dateTimeTest)
print(f'Loops: {len(dateTimeMain)} x {len(dateTimeTest)} = {len(dateTimeMain)*len(dateTimeTest)}')
print(f'Uncertainty Update Elapsed Time: {time.process_time() - tic:.3f} s')

Returns:

   Flag data from 2025-04-25 00:01:01 to 2025-04-25 00:04:30
   Flag data from 2025-04-25 00:55:01 to 2025-04-25 00:59:00
Loops: 7201 x 6631 = 47749831
Uncertainty Update Elapsed Time: 5.167 s

Solution

  • Faster code with Numpy

    A simple way to make this faster is to vectorise the code with Numpy:

    def findGaps_faster(dTM,dTT):
        bTs = []
        start = -1
        i, index, stop = 0,0,0
        tThreshold = timedelta(seconds=30)
    
        dTT = np.array(dTT, dtype=np.datetime64)
    
        for index, esTimeI in enumerate(dTM):
            esTimeI = np.datetime64(esTimeI)
            tDiff = np.abs(dTT - esTimeI)
            if np.min(tDiff) > tThreshold:
                i += 1
                if start == -1:
                    start = index
                stop = index
            else:
                if start != -1:
                    startstop = [dTM[start],dTM[stop]]
                    msg = f'   Flag data from {startstop[0]} to {startstop[1]}'
                    print(msg)
                    bTs.append(startstop)
                    start = -1
    
        if start != -1 and stop == index: # Records from a mid-point to the end are bad
            startstop = [dTM[start],dTM[stop]]
            bTs.append(startstop)
    
        return bTs
    

    This is 20 times faster and provide the same result.


    More efficient Numpy algorithm

    The current algorithm is inefficient. Indeed, the O(n) computation time for each item of dTM is expensive and results in a O(n m) time for the whole code. There is a way to make the algorithm significantly more efficient by sorting data. Indeed, you can find the closest value (i.e. np.min(np.abs(dTT - esTimeI)) with a binary search running in O(log n) time, so the overall execution time goes down to O(m log n). This operation can also still be vectorised with Numpy thanks to np.searchsorted:

    
    def findGaps_fastest(dTM,dTT):
        bTs = []
        start = -1
        i, index, stop = 0,0,0
        tThreshold = timedelta(seconds=30)
    
        # See below for faster conversions
        np_dTT = np.array(dTT, dtype=np.datetime64)
        np_dTT.sort()
    
        np_dTM = np.array(dTM, dtype=np.datetime64)
        pos = np.searchsorted(np_dTT, np_dTM, side='right')
    
        # Consider the 3 items close the the position found.
        # We can probably only consider 2 of them but this is simpler and less bug-prone.
        pos1 = np.maximum(pos-1, 0)
        pos2 = np.minimum(pos, np_dTT.size-1)
        pos3 = np.minimum(pos+1, np_dTT.size-1)
        tDiff1 = np.abs(np_dTT[pos1] - np_dTM)
        tDiff2 = np.abs(np_dTT[pos2] - np_dTM)
        tDiff3 = np.abs(np_dTT[pos3] - np_dTM)
        tMin = np.minimum(tDiff1, tDiff2, tDiff3)
    
        for index in range(len(np_dTM)):
            if tMin[index] > tThreshold:
                i += 1
                if start == -1:
                    start = index
                stop = index
            else:
                if start != -1:
                    startstop = [dTM[start],dTM[stop]]
                    msg = f'   Flag data from {startstop[0]} to {startstop[1]}'
                    print(msg)
                    bTs.append(startstop)
                    start = -1
    
        if start != -1 and stop == index: # Records from a mid-point to the end are bad
            startstop = [dTM[start],dTM[stop]]
            bTs.append(startstop)
    
        return bTs
    

    This is about 90 times faster!


    Analysis and further optimisations

    Note that most of the time is actually spent in the conversions from lists to Numpy arrays. Without that (i.e. assuming it can be pre-computed), it would be even faster (>200 times)! The rest of the time is mainly spent in the Python loop.

    If this is not enough, then you can do the conversion in parallel with multiple threads. You can certainly do that in Cython. Cython can also make the Python loop faster. In the end, most of the time should be spent in reading pure-Python object from the pure-Python list.

    A convoluted way to make the conversion faster is to first extract each timestamp as a float and then create the Numpy array quickly based on them. This is about 6 times faster on my machine and 90% of the time is spent in pure-Python datetime method calls. The precision of the date should be in the order of microseconds (not more). Here is an example:

    timestamps = [e.timestamp() for e in dTT]  # Expensive part
    np_dTT = np.array(np.array(timestamps, np.float64), dtype='datetime64[s]')