I have a Pandas DataFrame with a column containing different strings. I'm trying to find all rows where a specific string, "HELLO + WORLD", appears. However, when I use str.contains(), it only returns True for the first few rows. Here's my sample code:
import pandas as pd
df = pd.DataFrame({
'AREA': ["HELLO / WORLD"] * 3 +["HELLO + WORLD"] * 200
})
print(df['AREA'].str.contains("HELLO + WORLD"))
print(df['AREA'].str.contains("HELLO / WORLD"))
OUTPUT:
0 False
1 False
2 False
3 False
4 False
...
198 False
199 False
200 False
201 False
202 False
Name: AREA, Length: 203, dtype: bool
0 True
1 True
2 True
3 False
4 False
...
198 False
199 False
200 False
201 False
202 False
Name: AREA, Length: 203, dtype: bool
I expected to get True for all rows containing the correct substrings, but the output is mostly False. Can someone explain why this is happening and suggest a solution?
By default, pandas.Series.str.contains
takes a regex pattern, not a literal string.
So "HELLO + WORLD"
will try to match the string "HELLO"
followed by one or more space character (" +"
), followed by " WORLD"
.
To get your expected result, you either need to use a regex pattern that escapes the +
with a \
so it's interpreted as the plus character, or set regex=False
:
df['AREA'].str.contains("HELLO \+ WORLD")
# or
df['AREA'].str.contains("HELLO + WORLD", regex=False)
Both will output:
0 False
1 False
2 False
3 True
4 True
...
198 True
199 True
200 True
201 True
202 True
Name: AREA, Length: 203, dtype: bool