pythonpandasscikit-learnclassificationmultilabel-classification

Data Science Data Analysis


I have a dataset with people's characteristics and I need to predict their breakfast here's an example of df.

And I am training cat boost algorithm for that.

Is it possible in my case to predict not only one kind of breakfast, but also an additional one?

By additional I mean the second most appealing type of breakfast for a person.

#I started with this:

df_train, df_test = train_test_split(df, test_size=0.15, random_state=42)

df_train, df_valid = train_test_split(df_train, test_size=0.15, random_state=42)

features_train = df_train.drop(\['breakfast'\], axis=1)

target_train = df_train\['breakfast'\]

features_valid = df_valid.drop(\['breakfast'\], axis=1)

target_valid = df_valid\['breakfast'\]

features_test = df_test.drop(\['breakfast'\], axis=1)

target_test = df_test\['breakfast'\]

model_cat = CatBoostClassifier(random_state=42)

model_cat.fit(features_train, target_train)

valid_predictions_tree = model_cat.predict(features_valid)

#But this is supposed to train for a single categorical variable output, however I need not one but two best results.

Solution

  • Using predict_proba instead will return the probability for every class of your target:

    valid_predictions_tree = model_cat.predict_proba(features_valid)
    

    To get clean predictions for an input dt you can do this:

    proba = pd.DataFrame(model_cat.predict_proba(dt), columns=model_cat.classes_)
    

    Output example:

    Class1   Class2   Class3
    0.2      0.5      0.3
    0.7      0.2      0.1
    

    The total for each line is 1 (100%).