[SOLVED] Find duplicate "group of rows" in pandas DataFrame

Find duplicate "group of rows" in pandas DataFrame

How can I find duplicates of a group of rows inside of a DataFrame? Or in other words, how can I find the indices of a specific duplicated DataFrame inside of a larger DataFrame?

The larger DataFrame:

index	0	1
0	0	1
1	2	3
2	4	4
3	0	1
4	2	3
5	2	3
6	0	1

The specific duplicated DataFrame (or group of rows):

index	0	1
0	0	1
1	2	3

Indices I am looking for:

index
0
1
3
4

(Note that the indices of the duplicated DataFrame do not matter, only the values).

import pandas as pd

# larger DataFrame
lrg_df = pd.DataFrame([[0, 1], [2, 3], [4, 4], [0, 1], [2, 3], [2, 3], [0, 1]])

# group of rows (i.e., duplicated DataFrame)
dup_df = pd.DataFrame([[0, 1], [2, 3]])

# get indices of lrg_df that contain dup_df
indcs = lrg_df[lrg_df == dup_df].index  # Doesn't work of course

Solution

You need to check all combinations with a sliding window, using numpy.lib.stride_tricks.sliding_window_view to create a mask and extend the mask with numpy.convolve:

import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv

n = len(dup_df)
mask = (swv(lrg_df, n, axis=0)
        == dup_df.to_numpy().T
       ).all((1,2))

out = lrg_df[np.convolve(mask, np.ones(n))>0]

Output:

And if you want the indices:

indices = lrg_df.index[np.convolve(mask, np.ones(n))>0]

Output:

Index([0, 1, 3, 4], dtype='int64')

Intermediates:

# swv(lrg_df, n, axis=0) == dup_df.to_numpy().T
array([[[ True,  True],
        [ True,  True]],

       [[False, False],
        [False, False]],

       [[False, False],
        [False, False]],

       [[ True,  True],
        [ True,  True]],

       [[False,  True],
        [False,  True]],

       [[False, False],
        [False, False]]])

# mask
array([ True, False, False,  True, False, False])