[SOLVED] Unexpected results when using str.contains() in Pandas

Unexpected results when using str.contains() in Pandas

I have a Pandas DataFrame with a column containing different strings. I'm trying to find all rows where a specific string, "HELLO + WORLD", appears. However, when I use str.contains(), it only returns True for the first few rows. Here's my sample code:

import pandas as pd

df = pd.DataFrame({
    'AREA': ["HELLO / WORLD"] * 3 +["HELLO + WORLD"] * 200
})

print(df['AREA'].str.contains("HELLO + WORLD"))

print(df['AREA'].str.contains("HELLO / WORLD"))

OUTPUT:

0      False
1      False
2      False
3      False
4      False
       ...  
198    False
199    False
200    False
201    False
202    False
Name: AREA, Length: 203, dtype: bool

0       True
1       True
2       True
3      False
4      False
       ...  
198    False
199    False
200    False
201    False
202    False
Name: AREA, Length: 203, dtype: bool

I expected to get True for all rows containing the correct substrings, but the output is mostly False. Can someone explain why this is happening and suggest a solution?

Solution

By default, pandas.Series.str.contains takes a regex pattern, not a literal string.

So "HELLO + WORLD" will try to match the string "HELLO" followed by one or more space character (" +"), followed by " WORLD".

To get your expected result, you either need to use a regex pattern that escapes the + with a \ so it's interpreted as the plus character, or set regex=False:

df['AREA'].str.contains("HELLO \+ WORLD")
# or
df['AREA'].str.contains("HELLO + WORLD", regex=False)

Both will output:

0      False
1      False
2      False
3       True
4       True
       ...  
198     True
199     True
200     True
201     True
202     True
Name: AREA, Length: 203, dtype: bool