[SOLVED] Check for a substring containing a space in a string in a dataframe

Check for a substring containing a space in a string in a dataframe

I'm trying to remove rows that contain string based on the string value. To do, I'm looking for specific substring within the strings. My issue comes from the fact that those substrings contain spaces and that is seems my current filter does not take the space into consideration.

# List of substring to remove
searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']
# keeping only the rows with the substrings
df[df["""column"""].str.contains('|'.join(searchfor))]

Yet, I still retrieve some rows containing values such as: "XXVALUEXXXXXX"

I believe the space should have remove those but apparently not.

Solution

There's no issue with the space. The problem is that a dot (.) is a special character in regex, used to match any character. Escape the character with \, if you want a literal match:

searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.'] # dots escaped

E.g.:

import pandas as pd

df = pd.DataFrame({'column': ['XXVALUEXXXXXX',
                              'XXVALUE.XXXXXX',
                              'XXValue XXXXXX']})

df

           column
0   XXVALUEXXXXXX # will no longer be matched
1  XXVALUE.XXXXXX
2  XXValue XXXXXX

searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.']

df[df['column'].str.contains('|'.join(searchfor))]

           column
1  XXVALUE.XXXXXX
2  XXValue XXXXXX

Or map re.escape to each item in your list (thanks to @mozway in the comments):

import re

searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']

escaped_searchfor = '|'.join(map(re.escape, searchfor))

df[df['column'].str.contains(escaped_searchfor)]

           column
1  XXVALUE.XXXXXX
2  XXValue XXXXXX

Or using a list comprehension:

escaped_searchfor = '|'.join([re.escape(elem) for elem in searchfor])