python-3.xpandasdataframeduplicatespartial-matches

How to isolate duplicates based on partial match in a pandas dataframe


I have a pandas data frame, which looks like the following:

email                   col2  col3
email@example.com       John  Doe
xxxemail@example.com    John  Doe
xxemail@example.com     John  Doe
xxxxxemail@example.com  John  Doe
xxxemail@example2.com   Jane  Doe

I want to go through each email address starting with at least two 'x's and check whether the same email address exists without those 'x's.

Required result:

email                   col2  col3  exists_in_valid_form
email@example.com       John  Doe   False
xxxemail@example.com    John  Doe   True
xxemail@example.com     John  Doe   True
xxxxxemail@example.com  John  Doe   True
xxxemail@example2.com   Jane  Doe   False

I was able to get a sub-data frame containing all of those rows with the emails starting with 'xx' using df[df['email'].str.contains("xx")], and was also able to get the email addresses without the 'x's using str.lstrip('x'), but neither does not seem to help me get whether this email appears somewhere else without those x's or not.


Solution

  • You can use duplicated() to get whether a value is existing in other row.

    df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')
    

    I added df.email.str.startswith('xx') to make sure it should start with at least 2 "x" and return False for "xemail@example.com".