pythonmachine-learningxgboostdata-augmentation

How to train XGBoost with probabilities instead of class?


I am trying to train an XGBoost classifier by inputting the training dataset and the training labels. The labels are one hot encodings and instead of sending in the classes such as [0,1,0,0], I wanted to send an inputs of the probabilities like [0,0.6,0.4,0] for a particular training datapoint. The reason for this is because I want to implement the mixup algorithm for data augmentation which outputs the augmented data as floating points for the one hot codes for the new augmented labels.

However, I get an error on model.fit because it expects labels in the one-hot code and not probabilities for each class. How can I implement the data augmentation algorithm with my xgboost?

import xgboost as xgb
import numpy as np

# Generate some random data
X = np.random.rand(100, 16)

# Generate random one-hot encoded target variable
y_one_hot = np.random.randint(0, 4, size=(100,))
y = np.eye(4)[y_one_hot]

# Convert one-hot encoded target variable to probabilities
y_proba = np.zeros((y.shape[0], y.shape[1]))
for i, row in enumerate(y):
    y_proba[i] = row / np.sum(row)

# Define the XGBoost model
model = xgb.XGBClassifier(objective='multi:softprob', num_class=4)

# Train the model
model.fit(X, y_proba)

# Generate some test data
X_test = np.random.rand(10, 16)

# Predict the probabilities for each class
y_pred_proba = model.predict_proba(X_test)

# Get the predicted class for each sample
y_pred = np.argmax(y_pred_proba, axis=1)

Solution

  • Idea

    You can use the sample_weight parameter to circumvent the label encoding restriction.

    Example

    Say your training data are these instances x_i with labels given as probabilities:

    x_1  [0, 1, 0, 0]
    x_2  [0, 0, 1, 0]
    x_3  [0, .6, .4, 0]
    x_4  [.7, 0, 0, .3]
    

    Transform them to these instances, with labels given directly:

    x_1  1
    x_2  2
    x_3  1
    x_3  2
    x_4  0
    x_4  3
    

    Then, when you plug these transformed data into the fit method, pass the argument sample_weight=[1, 1, .6, .4, .7, .3].

    General Implementation

    Given your model, X, and y_proba:

    n_samples, n_classes = y_proba.shape
    X_upsampled = X.repeat(n_classes, axis=0)
    y_direct = np.tile(range(n_classes), n_samples)
    sample_weights = y_proba.ravel()
    
    model.fit(X_upsampled, y_direct, sample_weight=sample_weights)