pythonpandas

Unexpected results when using str.contains() in Pandas


I have a Pandas DataFrame with a column containing different strings. I'm trying to find all rows where a specific string, "HELLO + WORLD", appears. However, when I use str.contains(), it only returns True for the first few rows. Here's my sample code:

import pandas as pd

df = pd.DataFrame({
    'AREA': ["HELLO / WORLD"] * 3 +["HELLO + WORLD"] * 200
})

print(df['AREA'].str.contains("HELLO + WORLD"))

print(df['AREA'].str.contains("HELLO / WORLD"))

OUTPUT:

0      False
1      False
2      False
3      False
4      False
       ...  
198    False
199    False
200    False
201    False
202    False
Name: AREA, Length: 203, dtype: bool

0       True
1       True
2       True
3      False
4      False
       ...  
198    False
199    False
200    False
201    False
202    False
Name: AREA, Length: 203, dtype: bool

I expected to get True for all rows containing the correct substrings, but the output is mostly False. Can someone explain why this is happening and suggest a solution?


Solution

  • By default, pandas.Series.str.contains takes a regex pattern, not a literal string.

    So "HELLO + WORLD" will try to match the string "HELLO" followed by one or more space character (" +"), followed by " WORLD".

    To get your expected result, you either need to use a regex pattern that escapes the + with a \ so it's interpreted as the plus character, or set regex=False:

    df['AREA'].str.contains("HELLO \+ WORLD")
    # or
    df['AREA'].str.contains("HELLO + WORLD", regex=False)
    

    Both will output:

    0      False
    1      False
    2      False
    3       True
    4       True
           ...  
    198     True
    199     True
    200     True
    201     True
    202     True
    Name: AREA, Length: 203, dtype: bool