pythonpandas

Check for duplicates in a column, while excluding a sub-string


I am looking for an optimized way of checking for any duplicates in a Panda dataframe column, but excluding a given position in every element of that column.

In the example there is a duplication in 'id1_ver1_ready' if we exclude the version number ('id1_ver1_ready' <-> 'id1_ver3_ready'). Same for ( 'id5_ver1_unknown' <-> 'id5_ver6_unknown')

from numpy import nan
df = pd.DataFrame({'ID': ['id1_ver1_ready', 'id2_ver1_unknown', 'id3_ver1_processed', 'id1_ver3_ready', 'id4_ver1_ready', 'id5_ver1_unknown', 'id5_ver6_unknown', 'id6_ver1_processed']})enter code here

Solution

  • Another possible solution, which filters df by removing duplicate entries based on the first and third parts of the ID column:

    df[~df['ID'].str.split('_', expand=True).duplicated([0,2])]
    

    Output:

                       ID
    0      id1_ver1_ready
    1    id2_ver1_unknown
    2  id3_ver1_processed
    4      id4_ver1_ready
    5    id5_ver1_unknown
    7  id6_ver1_processed