pythonnlpisin

string matches are empty although string is contained in column


I have two dataframes: one dataframe df with one column containig text data and another dataframe econ_terms with two columns containing positive and negative economic terms.

I want to remove all text rows that do not contain any strings from 'positive' or 'negative' economic terms

# Convert the column to a string
df['text'] = df['text'].astype(str)
econ_terms['plus'] = econ_terms['plus'].astype(str)
econ_terms['minus'] = econ_terms['minus'].astype(str)

# Get the unique values from 'plus' and 'minus' columns in the 'econ_terms' DataFrame
econ_values = set(econ_terms['plus']).union(set(econ_terms['minus']))

# Filter the 'df' DataFrame using boolean indexing
df_filtered = df[df['text'].isin(econ_values)]

the column 'minus' contains words such as unemployment, which is clearly in the 'text' column when going through the data manually.

However the df_filtered shows an empty dataframe. What could be the reason for this ?


Solution

  • You are describing one thing but your code is doing an entirely different thing.

    You want to search if 'text' contains the terms for 'plus' & 'minus' but your checking if the 'text' is contained within 'plus' & 'minus'. This example illustrates the problem:

    import pandas as pd
    
    d = {'text': ['equally good_value', 'bad_value', 'good_value']}
    df = pd.DataFrame(data=d)
    
    d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
    econ_values = pd.DataFrame(data=d)
    
    df['text'] = df['text'].astype(str)
    econ_values['plus'] = econ_values['plus'].astype(str)
    econ_values['minus'] = econ_values['minus'].astype(str)
    
    econ_values = set(econ_values['plus']).union(set(econ_values['minus']))
    
    df_filtered = df[df['text'].isin(econ_values)]
    
    print(df_filtered)
    

    Output:

             text
    2  good_value
    

    Notice how 'equally good_value' disappeared because it was not in 'plus' & 'minus' even though it contains 'good_value'.

    You need to search the other way around:

    import pandas as pd
    
    d = {'text': ['equally good_value', 'bad_value', 'good_value']}
    df = pd.DataFrame(data=d)
    
    d = {'plus': ['good_value', 'more_values'], 'minus': ['good_value', 'some_other_value']}
    econ_values = pd.DataFrame(data=d)
    
    df['text'] = df['text'].astype(str)
    econ_values['plus'] = econ_values['plus'].astype(str)
    econ_values['minus'] = econ_values['minus'].astype(str)
    
    econ_values = set(econ_values['plus']).union(set(econ_values['minus']))
    
    def search_text(row):
        match_count = 0
        for term in econ_values:
            if term in row:
                match_count +=1
        if match_count > 0:
            return True
        else:
            return False
    
    df['contains'] = df['text'].apply(lambda row: search_text(row))
    

    Output:

                     text     contains
    0  equally good_value      True
    1           bad_value     False
    2          good_value      True
    

    Then you can filter rows:

    df_filtered = df.loc[df['contains']==True]
    print(df_filtered)
    

    Output:

                     text  contains
    0  equally good_value  True
    2          good_value  True
    

    The example above reverses the search. It checks if 'plus' & 'minus' are inside 'text' not the other way around.