pandasdataframemachine-learningdata-cleaning

Separate a ingredients/feature into separate columns that is marked with "0" or "1"


I'm looking at a some food waste data where I have a fair bit of data including the Ingredients for what was in the food. I'm trying to do some ML on the data, and I'm having some trouble getting it ready.

example

I am able to separate all of individual ingredients into there own column, but I am struggling marking a 1 when that ingredient is in the list/column of Ingredients.

example2

I have been trying something along the lines of going row by row and seeing if the column name is in the Ingredients column then changing that to a 1. I haven't got very close at all, but this is the current disaster I have been trying. Also have experimented with get_dummies and a few others.

Is there an easier way of doing this?

def xs_os(df, Ingredients_Column):
    df2 = df.drop("Ingredients", axis = 0) 
    for z in df:
        for x in list(df2.columns.values):
            if x in str(Ingredients_Column):
                df.at[z, df[x]] = 1
xs_os(df, df['Ingredients'])
df.head()

Solution

  • A solution using scikit-learn MultiLabelBinarizer, which is precisely designed for what you want to do:

    from sklearn.preprocessing import MultiLabelBinarizer
    import pandas as pd
    
    df = pd.DataFrame(
        {
            "food": [
                "pepper steak", 
                "baked salmon"
            ],
            "ingredients":[
                "black pepper,steak", 
                "salmon,salt,black pepper"
            ]
        }
    )
    
    mlb = MultiLabelBinarizer()
    ingredients_lists = df["ingredients"].str.split(",")
    ingredients_dummies_df = pd.DataFrame(
        mlb.fit_transform(ingredients_lists),
        columns = mlb.classes_,
        index=df.index
    )
    
    df = pd.concat([df, ingredients_dummies_df], axis= 1)
    

    Beware that in your precise case, you may have to remove blanks before ingredient names to avoid creating a column " pepper" (with a preceding white space) in addition to the column "pepper". For instance,

    ingredients_lists = df["ingredients"].str.replace(", ", ",").str.split(",")