I'm looking at a some food waste data where I have a fair bit of data including the Ingredients for what was in the food. I'm trying to do some ML on the data, and I'm having some trouble getting it ready.
I am able to separate all of individual ingredients into there own column, but I am struggling marking a 1 when that ingredient is in the list/column of Ingredients.
I have been trying something along the lines of going row by row and seeing if the column name is in the Ingredients column then changing that to a 1. I haven't got very close at all, but this is the current disaster I have been trying. Also have experimented with get_dummies and a few others.
Is there an easier way of doing this?
def xs_os(df, Ingredients_Column):
df2 = df.drop("Ingredients", axis = 0)
for z in df:
for x in list(df2.columns.values):
if x in str(Ingredients_Column):
df.at[z, df[x]] = 1
xs_os(df, df['Ingredients'])
df.head()
A solution using scikit-learn MultiLabelBinarizer
, which is precisely designed for what you want to do:
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd
df = pd.DataFrame(
{
"food": [
"pepper steak",
"baked salmon"
],
"ingredients":[
"black pepper,steak",
"salmon,salt,black pepper"
]
}
)
mlb = MultiLabelBinarizer()
ingredients_lists = df["ingredients"].str.split(",")
ingredients_dummies_df = pd.DataFrame(
mlb.fit_transform(ingredients_lists),
columns = mlb.classes_,
index=df.index
)
df = pd.concat([df, ingredients_dummies_df], axis= 1)
Beware that in your precise case, you may have to remove blanks before ingredient names to avoid creating a column " pepper" (with a preceding white space) in addition to the column "pepper". For instance,
ingredients_lists = df["ingredients"].str.replace(", ", ",").str.split(",")