pythonpandasdataframeindexingduplicates

Find duplicate "group of rows" in pandas DataFrame


How can I find duplicates of a group of rows inside of a DataFrame? Or in other words, how can I find the indices of a specific duplicated DataFrame inside of a larger DataFrame?

The larger DataFrame:

index 0 1
0 0 1
1 2 3
2 4 4
3 0 1
4 2 3
5 2 3
6 0 1

The specific duplicated DataFrame (or group of rows):

index 0 1
0 0 1
1 2 3

Indices I am looking for:

index
0
1
3
4

(Note that the indices of the duplicated DataFrame do not matter, only the values).

import pandas as pd

# larger DataFrame
lrg_df = pd.DataFrame([[0, 1], [2, 3], [4, 4], [0, 1], [2, 3], [2, 3], [0, 1]])

# group of rows (i.e., duplicated DataFrame)
dup_df = pd.DataFrame([[0, 1], [2, 3]])

# get indices of lrg_df that contain dup_df
indcs = lrg_df[lrg_df == dup_df].index  # Doesn't work of course

Solution

  • You need to check all combinations with a sliding window, using numpy.lib.stride_tricks.sliding_window_view to create a mask and extend the mask with numpy.convolve:

    import numpy as np
    from numpy.lib.stride_tricks import sliding_window_view as swv
    
    n = len(dup_df)
    mask = (swv(lrg_df, n, axis=0)
            == dup_df.to_numpy().T
           ).all((1,2))
    
    out = lrg_df[np.convolve(mask, np.ones(n))>0]
    

    Output:

       0  1
    0  0  1
    1  2  3
    3  0  1
    4  2  3
    

    And if you want the indices:

    indices = lrg_df.index[np.convolve(mask, np.ones(n))>0]
    

    Output:

    Index([0, 1, 3, 4], dtype='int64')
    

    Intermediates:

    # swv(lrg_df, n, axis=0) == dup_df.to_numpy().T
    array([[[ True,  True],
            [ True,  True]],
    
           [[False, False],
            [False, False]],
    
           [[False, False],
            [False, False]],
    
           [[ True,  True],
            [ True,  True]],
    
           [[False,  True],
            [False,  True]],
    
           [[False, False],
            [False, False]]])
    
    # mask
    array([ True, False, False,  True, False, False])