I am trying to train an XGBoost classifier by inputting the training dataset and the training labels. The labels are one hot encodings and instead of sending in the classes such as [0,1,0,0], I wanted to send an inputs of the probabilities like [0,0.6,0.4,0] for a particular training datapoint. The reason for this is because I want to implement the mixup algorithm for data augmentation which outputs the augmented data as floating points for the one hot codes for the new augmented labels.
However, I get an error on model.fit because it expects labels in the one-hot code and not probabilities for each class. How can I implement the data augmentation algorithm with my xgboost?
import xgboost as xgb
import numpy as np
# Generate some random data
X = np.random.rand(100, 16)
# Generate random one-hot encoded target variable
y_one_hot = np.random.randint(0, 4, size=(100,))
y = np.eye(4)[y_one_hot]
# Convert one-hot encoded target variable to probabilities
y_proba = np.zeros((y.shape[0], y.shape[1]))
for i, row in enumerate(y):
y_proba[i] = row / np.sum(row)
# Define the XGBoost model
model = xgb.XGBClassifier(objective='multi:softprob', num_class=4)
# Train the model
model.fit(X, y_proba)
# Generate some test data
X_test = np.random.rand(10, 16)
# Predict the probabilities for each class
y_pred_proba = model.predict_proba(X_test)
# Get the predicted class for each sample
y_pred = np.argmax(y_pred_proba, axis=1)
You can use the sample_weight
parameter to circumvent the label encoding restriction.
Say your training data are these instances x_i with labels given as probabilities:
x_1 [0, 1, 0, 0]
x_2 [0, 0, 1, 0]
x_3 [0, .6, .4, 0]
x_4 [.7, 0, 0, .3]
Transform them to these instances, with labels given directly:
x_1 1
x_2 2
x_3 1
x_3 2
x_4 0
x_4 3
Then, when you plug these transformed data into the fit
method, pass the argument sample_weight=[1, 1, .6, .4, .7, .3]
.
Given your model
, X
, and y_proba
:
n_samples, n_classes = y_proba.shape
X_upsampled = X.repeat(n_classes, axis=0)
y_direct = np.tile(range(n_classes), n_samples)
sample_weights = y_proba.ravel()
model.fit(X_upsampled, y_direct, sample_weight=sample_weights)