I have a list of lists of two elements with type pd.Timestamp
, e.g.:
intervals = [
[pd.Timestamp('2023-01-01 02:00:00'), pd.Timestamp('2023-01-01 03:00:00')],
[pd.Timestamp('2023-01-01 05:00:00'), pd.Timestamp('2023-01-01 07:00:00')],
[pd.Timestamp('2023-01-01 07:30:00'), pd.Timestamp('2023-01-01 08:00:00')],
]
I also have a sample dataframe, e.g.:
data = {'value': [1, 2, 3, 4]}
index = [
pd.Timestamp('2023-01-01 01:00:00'),
pd.Timestamp('2023-01-01 02:00:00'),
pd.Timestamp('2023-01-01 03:00:00'),
pd.Timestamp('2023-01-01 04:00:00'),
]
df = pd.DataFrame(data, index=index)
Now, how can I check if the index of the dataframe is inside all the datetime intervals? In other words, for every element of the index of the dataframe (a row) check if that row is contained in the first interval, or in the second interval, or in the third intervals, and so on by doing something like:
for start, end in intervals:
df.index >= start & df.index >= end
My idea was to:
masks = [
[False, True, True, False],
[False, False, False, False],
[False, False, False, False],
]
mask = [False, True, True, False]
I wish to do 1 and 2 in a vectorized manner, so I would like to use numpy or pandas when possible. I am managing the data structure in Python with lists though at the moment. Which is the fastest way and which is the most elegant one also?
You can use an IntervalIndex
with get_indexer_for
to identify the existing matches, then all
to aggregate:
idx = pd.IntervalIndex.from_tuples(list(map(tuple, intervals)), closed='both')
# or
# idx = pd.IntervalIndex.from_arrays(*zip(*intervals), closed='both')
out = (idx.get_indexer_for(df.index) != -1).all()
Output: False
How it works:
idx.get_indexer_for(df.index)
# array([-1, 0, 0, -1])
# check if values are found
idx.get_indexer_for(df.index) != -1
# array([False, True, True, False])
# are all values found?
(idx.get_indexer_for(df.index) != -1).all()
# False
which = idx.get_indexer_for(df.index) != -1
Output: array([False, True, True, False])
If you need to compare all combinations, use numpy:
i = df.index.to_numpy()[:,None]
idx = pd.IntervalIndex.from_arrays(*zip(*intervals), closed='both')
out = (i >= idx.left) & (i <= idx.right)
Output:
array([[False, False, False, False],
[ True, False, False, False],
[ True, False, False, False],
[False, False, False, False]])