I'm trying to remove rows that contain string based on the string value. To do, I'm looking for specific substring within the strings. My issue comes from the fact that those substrings contain spaces and that is seems my current filter does not take the space into consideration.
# List of substring to remove
searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']
# keeping only the rows with the substrings
df[df["""column"""].str.contains('|'.join(searchfor))]
Yet, I still retrieve some rows containing values such as: "XXVALUEXXXXXX"
I believe the space should have remove those but apparently not.
There's no issue with the space. The problem is that a dot (.
) is a special character in regex, used to match any character. Escape the character with \
, if you want a literal match:
searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.'] # dots escaped
E.g.:
import pandas as pd
df = pd.DataFrame({'column': ['XXVALUEXXXXXX',
'XXVALUE.XXXXXX',
'XXValue XXXXXX']})
df
column
0 XXVALUEXXXXXX # will no longer be matched
1 XXVALUE.XXXXXX
2 XXValue XXXXXX
searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.']
df[df['column'].str.contains('|'.join(searchfor))]
column
1 XXVALUE.XXXXXX
2 XXValue XXXXXX
Or map
re.escape
to each item in your list (thanks to @mozway in the comments):
import re
searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']
escaped_searchfor = '|'.join(map(re.escape, searchfor))
df[df['column'].str.contains(escaped_searchfor)]
column
1 XXVALUE.XXXXXX
2 XXValue XXXXXX
Or using a list comprehension:
escaped_searchfor = '|'.join([re.escape(elem) for elem in searchfor])