[SOLVED] How do I replace some values within a df for a value of a tuple?

How do I replace some values within a df for a value of a tuple?

I am learning about machine learning (ML), and I decided to use it for spam and non-spam email classification.

The issue is that for the example data I am using, it is in the form of email subject, importance, and sender, where each one is a string. What I want to do is change them into vectors like [1,0,0] so that I can differentiate each value.

The error I am encountering is that I cannot replace the vector with a value because the sizes do not match.

def vec(u_v):
    y = len(u_v)
    x = [0] * y
    for j in range(y):
        x[j] = 1
        u_v[j] = tuple(x.copy()) 
        x = [0] * y
    return u_v


def arrange(df):
    organized_df = df.copy()  
    for i in df.columns:
        unique_values = df[i].unique()
        replacement_values = vec(unique_values)
        for j in range(len(unique_values)):
            organized_df[i] = organized_df[i].replace({unique_values[j]: replacement_values[j]})

    return organized_df

These are the two functions that I'm using to organize the dataframe, this is the error that I receive

ValueError: operands could not be broadcast together with shapes (1000,) (6,)

I was expecting something like this:

| Subject | Importance |
| -------- | -------- |
| [1,0,0]   | [0,0,1]   |
| [0,1,0]   | [1,0,0]   |

Solution

With pandas, you can achieve this using get_dummies:

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

import pandas as pd

df = pd.DataFrame({
    'Subject': ['spam', 'not-spam', 'spam', 'not-spam'],
    'Importance': ['low', 'high', 'medium', 'low']
})


organized_df = pd.get_dummies(df)

Output :

Subject_not_spam	Subject_spam	Importance_high	Importance_low	Importance_medium
0	1	0	1	0
1	0	1	0	0
0	1	0	0	1
1	0	0	1	0