I am looking for an optimized way of checking for any duplicates in a Panda dataframe column, but excluding a given position in every element of that column.
In the example there is a duplication in 'id1_ver1_ready' if we exclude the version number ('id1_ver1_ready' <-> 'id1_ver3_ready'). Same for ( 'id5_ver1_unknown' <-> 'id5_ver6_unknown')
from numpy import nan
df = pd.DataFrame({'ID': ['id1_ver1_ready', 'id2_ver1_unknown', 'id3_ver1_processed', 'id1_ver3_ready', 'id4_ver1_ready', 'id5_ver1_unknown', 'id5_ver6_unknown', 'id6_ver1_processed']})enter code here
Another possible solution, which filters df
by removing duplicate entries based on the first and third parts of the ID
column:
The str.split('_', expand=True)
method splits the ID
strings at the underscores into separate columns, resulting in a dataframe where each part of the ID
is in its own column.
The duplicated([0, 2])
method then identifies rows that have the same values in the first (index 0) and third (index 2) columns, marking duplicates as True
.
The tilde ~
operator negates this boolean mask, so the entire expression returns only the rows of df
that have unique combinations of values in the first and third parts of the ID
.
df[~df['ID'].str.split('_', expand=True).duplicated([0,2])]
Output:
ID
0 id1_ver1_ready
1 id2_ver1_unknown
2 id3_ver1_processed
4 id4_ver1_ready
5 id5_ver1_unknown
7 id6_ver1_processed