pythonpandassubstringspace

Check for a substring containing a space in a string in a dataframe


I'm trying to remove rows that contain string based on the string value. To do, I'm looking for specific substring within the strings. My issue comes from the fact that those substrings contain spaces and that is seems my current filter does not take the space into consideration.

# List of substring to remove
searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']
# keeping only the rows with the substrings
df[df["""column"""].str.contains('|'.join(searchfor))]

Yet, I still retrieve some rows containing values such as: "XXVALUEXXXXXX"

I believe the space should have remove those but apparently not.


Solution

  • There's no issue with the space. The problem is that a dot (.) is a special character in regex, used to match any character. Escape the character with \, if you want a literal match:

    searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.'] # dots escaped
    

    E.g.:

    import pandas as pd
    
    df = pd.DataFrame({'column': ['XXVALUEXXXXXX',
                                  'XXVALUE.XXXXXX',
                                  'XXValue XXXXXX']})
    
    df
    
               column
    0   XXVALUEXXXXXX # will no longer be matched
    1  XXVALUE.XXXXXX
    2  XXValue XXXXXX
    
    searchfor = ['Value ', 'VALUE ', 'Value\.', 'VALUE\.']
    
    df[df['column'].str.contains('|'.join(searchfor))]
    
               column
    1  XXVALUE.XXXXXX
    2  XXValue XXXXXX
    

    Or map re.escape to each item in your list (thanks to @mozway in the comments):

    import re
    
    searchfor = ['Value ', 'VALUE ', 'Value.', 'VALUE.']
    
    escaped_searchfor = '|'.join(map(re.escape, searchfor))
    
    df[df['column'].str.contains(escaped_searchfor)]
    
               column
    1  XXVALUE.XXXXXX
    2  XXValue XXXXXX
    

    Or using a list comprehension:

    escaped_searchfor = '|'.join([re.escape(elem) for elem in searchfor])