dataframepython-polars

Join timestamps between 2 datasets while maintaining time difference


I have the following 2 datasets:

  1. This one is very reliable at telling me if an event happened, but it's timestamp is only good to within a few seconds (let's assume 2 seconds):
    coarse = {
        "name": ["a", "a", "b", "c", "a"],
        "timestamp": [100, 103, 195, 220, 221],
    }
    coarse_df = pl.DataFrame(coarse)
    """
    ┌──────┬───────────┐
    │ name ┆ timestamp │
    │ ---  ┆ ---       │
    │ str  ┆ i64       │
    ╞══════╪═══════════╡
    │ a    ┆ 100       │
    │ a    ┆ 103       │
    │ b    ┆ 195       │
    │ c    ┆ 220       │
    │ a    ┆ 221       │
    └──────┴───────────┘
    """
    
  1. This one has a very accurate timing, but it has some noise/false positives (note that t=0 is different for both datasets, there is an arbitrary offset):
    fine = {
        "name": ["a", "a", "a", "a", "b", "c", "b", "a"],
        "time": [0.05, 10.05, 12.51, 51.12, 106.0, 128.01, 130.0, 132.9],
    }
    fine_df = pl.DataFrame(fine)
    """
    ┌──────┬────────┐
    │ name ┆ time   │
    │ ---  ┆ ---    │
    │ str  ┆ f64    │
    ╞══════╪════════╡
    │ a    ┆ 0.05   │
    │ a    ┆ 10.05  │
    │ a    ┆ 12.51  │
    │ a    ┆ 51.12  │
    │ b    ┆ 106.0  │
    │ c    ┆ 128.01 │
    │ b    ┆ 130.0  │
    │ a    ┆ 132.9  │
    └──────┴────────┘
    """
    
    

I am trying to somehow join these data sets to obtain something like the following. Essentially getting the timestamps from the second dataset, and using the time differences in the first dataset to filter out the false-positives.

"""
┌──────┬────────┐
│ name ┆ time   │
│ ---  ┆ ---    │
│ str  ┆ f64    │
╞══════╪════════╡
│ a    ┆ 10.05  │
│ a    ┆ 12.51  │
│ b    ┆ 106.0  │
│ c    ┆ 128.01 │
│ a    ┆ 132.9  │
└──────┴────────┘
"""

EDIT

What I am currently doing to identify false positive (in words, because this is an ugly nested for loop):

Given that the offset between both datasets is arbitrary, assume that the first "a" event is the real one:

Now coarse looks like shifting time by 100:

┌──────┬───────────┐
│ name ┆ timestamp │
│ ---  ┆ ---       │
│ str  ┆ i64       │
╞══════╪═══════════╡
│ a    ┆ 0         │  -> Good, there is a timestamp in `fine` within 2s
│ a    ┆ 3         │  -> Bad, no timestamp in `fine` matches
│ b    ┆ 95        │  -> Bad, ditto
│ c    ┆ 120       │  -> Bad, ditto
│ a    ┆ 121       │  -> Bad, ditto
└──────┴───────────┘

Ok, I didn't find all hits, then the second "a" must be the real one (shifting by 90 seconds instead):

┌──────┬───────────┐
│ name ┆ timestamp │
│ ---  ┆ ---       │
│ str  ┆ i64       │
╞══════╪═══════════╡
│ a    ┆ 10        │  -> Good, it matches 10.05
│ a    ┆ 13        │  -> Good, it matches 12.51
│ b    ┆ 105       │  -> Good, it matches 106.0
│ c    ┆ 130       │  -> Good, it matches 128.01
│ a    ┆ 131       │  -> Good it matches 132.9
└──────┴───────────┘

Basically, I am sliding the second dataset through time until I find a "time pattern" that matches all events in the first dataframe into a subset of the second one.


Solution

  • The final table can be generated with an ASOF JOIN using the "nearest" strategy.

    (coarse_df
      .with_columns(pl.col.timestamp - 90.0)  
      .join_asof(fine_df, by="name", left_on="timestamp", right_on="time", strategy="nearest")
    )
    
    shape: (5, 3)
    ┌──────┬───────────┬────────┐
    │ name ┆ timestamp ┆ time   │
    │ ---  ┆ ---       ┆ ---    │
    │ str  ┆ f64       ┆ f64    │
    ╞══════╪═══════════╪════════╡
    │ a    ┆ 10.0      ┆ 10.05  │
    │ a    ┆ 13.0      ┆ 12.51  │
    │ b    ┆ 105.0     ┆ 106.0  │
    │ c    ┆ 130.0     ┆ 128.01 │
    │ a    ┆ 131.0     ┆ 132.9  │
    └──────┴───────────┴────────┘
    

    However, it sounds like you want to compare multiple "shift values" at once.

    (
       pl.concat(
          coarse_df.with_columns(pl.col.timestamp - float(shift), shift=shift)
          for shift in range(100, -10, -10)
       )
       .sort("timestamp") # asof requires sorted data
       .join_asof(fine_df, by="name", left_on="timestamp", right_on="time", strategy="nearest")
       .filter((pl.col.timestamp - pl.col.time).abs() <= 2) # within threshold
    )
    
    shape: (8, 4)
    ┌──────┬───────────┬───────┬────────┐
    │ name ┆ timestamp ┆ shift ┆ time   │
    │ ---  ┆ ---       ┆ ---   ┆ ---    │
    │ str  ┆ f64       ┆ i32   ┆ f64    │
    ╞══════╪═══════════╪═══════╪════════╡
    │ a    ┆ 0.0       ┆ 100   ┆ 0.05   │
    │ a    ┆ 10.0      ┆ 90    ┆ 10.05  │
    │ a    ┆ 13.0      ┆ 90    ┆ 12.51  │
    │ a    ┆ 50.0      ┆ 50    ┆ 51.12  │
    │ a    ┆ 53.0      ┆ 50    ┆ 51.12  │
    │ b    ┆ 105.0     ┆ 90    ┆ 106.0  │
    │ c    ┆ 130.0     ┆ 90    ┆ 128.01 │
    │ a    ┆ 131.0     ┆ 90    ┆ 132.9  │
    └──────┴───────────┴───────┴────────┘
    

    And find the shift(s) with the most values.

       ...
       .with_columns(pl.len().over("shift"))
       .filter(pl.col.len == pl.col.len.max())
    )   
    
    shape: (5, 5)
    ┌──────┬───────────┬───────┬────────┬─────┐
    │ name ┆ timestamp ┆ shift ┆ time   ┆ len │
    │ ---  ┆ ---       ┆ ---   ┆ ---    ┆ --- │
    │ str  ┆ f64       ┆ i32   ┆ f64    ┆ u32 │
    ╞══════╪═══════════╪═══════╪════════╪═════╡
    │ a    ┆ 10.0      ┆ 90    ┆ 10.05  ┆ 5   │
    │ a    ┆ 13.0      ┆ 90    ┆ 12.51  ┆ 5   │
    │ b    ┆ 105.0     ┆ 90    ┆ 106.0  ┆ 5   │
    │ c    ┆ 130.0     ┆ 90    ┆ 128.01 ┆ 5   │
    │ a    ┆ 131.0     ┆ 90    ┆ 132.9  ┆ 5   │
    └──────┴───────────┴───────┴────────┴─────┘