I have a pandas data frame, which looks like the following:
email col2 col3
email@example.com John Doe
xxxemail@example.com John Doe
xxemail@example.com John Doe
xxxxxemail@example.com John Doe
xxxemail@example2.com Jane Doe
I want to go through each email address starting with at least two 'x's and check whether the same email address exists without those 'x's.
Required result:
email col2 col3 exists_in_valid_form
email@example.com John Doe False
xxxemail@example.com John Doe True
xxemail@example.com John Doe True
xxxxxemail@example.com John Doe True
xxxemail@example2.com Jane Doe False
I was able to get a sub-data frame containing all of those rows with the emails starting with 'xx' using df[df['email'].str.contains("xx")]
, and was also able to get the email addresses without the 'x's using str.lstrip('x')
, but neither does not seem to help me get whether this email appears somewhere else without those x's or not.
You can use duplicated()
to get whether a value is existing in other row.
df['exists_in_valid_form'] = df.email.str.lstrip('x').duplicated(keep=False) & df.email.str.startswith('xx')
I added df.email.str.startswith('xx')
to make sure it should start with at least 2 "x" and return False for "xemail@example.com".