I am dealing with a classification problem with 3 classes [0,1,2], and imbalanced class distribution as shown below.
I want to apply XGBClassifier (in Python) to this classification problem, but the model does not respond to class_weight
adjustments and skews towards the majority class 0, and ignores the minority classes 1,2. Which hyperparameters other than class_weight
can help me?
I tried 1) computing class weights using sklearn compute_class_weight
; 2) setting weights according to the relative frequency of the classes; 3) and also manually adjusting classes with extreme values to see if any change happens at all, such as {0:0.5,1:100,2:200}
. But in any case, it does not help the classifier to take the minority classes into account.
Observations:
I can handle the problem in the binary case: If I make the problem a binary classification by identifying classes [1,2], then I can get the classifier work properly by adjusting scale_pos_weight
(even in this case class_weight
alone does not help).
But scale_pos_weight
, as far as I know, works for binary classification. Is there an analogue of this parameter for the multi-classification problems?
Using RandomForestClassifier
instead of XGBClassifier
, I can handle the problem by setting class_weight='balanced_subsample'
and tunning max_leaf_nodes
. But, for some reason, this approach does not work for XGBClassifier.
Remark: I know about balancing techniques, such as over/undersampling, or SMOTE. But I want to avoid them as much as possible, and prefer a solutions using hyperparameter tunning of the model if possible. My observation above shows that this can work for the binary case.
sample_weight
parameter is useful for handling imbalanced data while using XGBoost
for training the data. You can compute sample weights by using compute_sample_weight()
of sklearn
library.
This code should work for multiclass data:
from sklearn.utils.class_weight import compute_sample_weight
sample_weights = compute_sample_weight(
class_weight='balanced',
y=train_df['class'] #provide your own target name
)
xgb_classifier.fit(X, y, sample_weight=sample_weights)