pythonpandasdataframe

How do I replace some values within a df for a value of a tuple?


I am learning about machine learning (ML), and I decided to use it for spam and non-spam email classification.

The issue is that for the example data I am using, it is in the form of email subject, importance, and sender, where each one is a string. What I want to do is change them into vectors like [1,0,0] so that I can differentiate each value.

The error I am encountering is that I cannot replace the vector with a value because the sizes do not match.

def vec(u_v):
    y = len(u_v)
    x = [0] * y
    for j in range(y):
        x[j] = 1
        u_v[j] = tuple(x.copy()) 
        x = [0] * y
    return u_v


def arrange(df):
    organized_df = df.copy()  
    for i in df.columns:
        unique_values = df[i].unique()
        replacement_values = vec(unique_values)
        for j in range(len(unique_values)):
            organized_df[i] = organized_df[i].replace({unique_values[j]: replacement_values[j]})

    return organized_df

These are the two functions that I'm using to organize the dataframe, this is the error that I receive

ValueError: operands could not be broadcast together with shapes (1000,) (6,) 

I was expecting something like this:

| Subject | Importance |
| -------- | -------- |
| [1,0,0]   | [0,0,1]   |
| [0,1,0]   | [1,0,0]   |

Solution

  • With pandas, you can achieve this using get_dummies:

    Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

    import pandas as pd
    
    df = pd.DataFrame({
        'Subject': ['spam', 'not-spam', 'spam', 'not-spam'],
        'Importance': ['low', 'high', 'medium', 'low']
    })
    
    
    organized_df = pd.get_dummies(df)
    

    Output :

    Subject_not_spam Subject_spam Importance_high Importance_low Importance_medium
    0 1 0 1 0
    1 0 1 0 0
    0 1 0 0 1
    1 0 0 1 0