I have two dataframes: one dataframe df with one column containig text data and another dataframe econ_terms with two columns containing positive and negative economic terms.
I want to remove all text rows that do not contain any strings from 'positive' or 'negative' economic terms
# Convert the column to a string
df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)
# Get the unique values from 'plus' and 'minus' columns in the 'econ_terms' DataFrame
econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))
# Filter the 'df' DataFrame using boolean indexing
df_filtered = df[df['text'].isin(econ_values)]
the column 'minus' contains words such as unemployment, which is clearly in the 'text' column when going through the data manually.
However the df_filtered shows an empty dataframe. What could be the reason for this ?
You are describing one thing but your code is doing an entirely different thing.
You want to search if 'text' contains the terms for 'plus' & 'minus' but your checking if the 'text' is contained within 'plus' & 'minus'. This example illustrates the problem:
import pandas as pd
d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)
d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)
df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)
econ_values = set(econ_values['plus']).union(set(econ_values['minus']))
df_filtered = df[df['text'].isin(econ_values)]
print(df_filtered)
Output:
text
2 good_value
Notice how 'equally good_value' disappeared because it was not in 'plus' & 'minus' even though it contains 'good_value'.
You need to search the other way around:
import pandas as pd
d = {'text': ['equally good_value', 'bad_value', 'good_value']}
df = pd.DataFrame(data=d)
d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
econ_values = pd.DataFrame(data=d)
df['text'] = df['text'].astype(str)
econ_values['plus'] = econ_values['plus'].astype(str)
econ_values['minus'] = econ_values['minus'].astype(str)
econ_values = set(econ_values['plus']).union(set(econ_values['minus']))
def search_text(row):
match_count = 0
for term in econ_values:
if term in row:
match_count +=1
if match_count > 0:
return True
else:
return False
df['contains'] = df['text'].apply(lambda row: search_text(row))
Output:
text contains
0 equally good_value True
1 bad_value False
2 good_value True
Then you can filter rows:
df_filtered = df.loc[df['contains']==True]
print(df_filtered)
Output:
text contains
0 equally good_value True
2 good_value True
The example above reverses the search. It checks if 'plus' & 'minus' are inside 'text' not the other way around.