How can I find duplicates of a group of rows inside of a DataFrame? Or in other words, how can I find the indices of a specific duplicated DataFrame inside of a larger DataFrame?
The larger DataFrame:
index | 0 | 1 |
---|---|---|
0 | 0 | 1 |
1 | 2 | 3 |
2 | 4 | 4 |
3 | 0 | 1 |
4 | 2 | 3 |
5 | 2 | 3 |
6 | 0 | 1 |
The specific duplicated DataFrame (or group of rows):
index | 0 | 1 |
---|---|---|
0 | 0 | 1 |
1 | 2 | 3 |
Indices I am looking for:
index |
---|
0 |
1 |
3 |
4 |
(Note that the indices of the duplicated DataFrame do not matter, only the values).
import pandas as pd
# larger DataFrame
lrg_df = pd.DataFrame([[0, 1], [2, 3], [4, 4], [0, 1], [2, 3], [2, 3], [0, 1]])
# group of rows (i.e., duplicated DataFrame)
dup_df = pd.DataFrame([[0, 1], [2, 3]])
# get indices of lrg_df that contain dup_df
indcs = lrg_df[lrg_df == dup_df].index # Doesn't work of course
You need to check all combinations with a sliding window, using numpy.lib.stride_tricks.sliding_window_view
to create a mask and extend the mask with numpy.convolve
:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
n = len(dup_df)
mask = (swv(lrg_df, n, axis=0)
== dup_df.to_numpy().T
).all((1,2))
out = lrg_df[np.convolve(mask, np.ones(n))>0]
Output:
0 1
0 0 1
1 2 3
3 0 1
4 2 3
And if you want the indices:
indices = lrg_df.index[np.convolve(mask, np.ones(n))>0]
Output:
Index([0, 1, 3, 4], dtype='int64')
Intermediates:
# swv(lrg_df, n, axis=0) == dup_df.to_numpy().T
array([[[ True, True],
[ True, True]],
[[False, False],
[False, False]],
[[False, False],
[False, False]],
[[ True, True],
[ True, True]],
[[False, True],
[False, True]],
[[False, False],
[False, False]]])
# mask
array([ True, False, False, True, False, False])