pythonmachine-learningscikit-learnrandom-forest

How to improve performance of random forest multiclass classification model?


I am working on a multi class classification for segmenting customers into 3 different classes based on their purchasing behavior and demographics. I cannot disclose the data set completely but in general it contains around 300 features and 50000 rows. I have tried the following methods but I am unable to achieve accuracy above 50% :

  1. Tuning the hyperparameters ( I am using tuned hyperparameters after doing GridSearchCV)
  2. Normalizing the dataset and then running my models
  3. Tried different classification methods : OneVsRestClassifier, RandomForestClassification, SVM, KNN and LDA
  4. I have also removed irrelevant features and tried running my models
  5. My classes were imbalanced, so I have also tried using class_weight = balanced, oversampling using SMOTE, downsampling and resampling.

Is there something else I can try to improve performance (f-score, precision and recall)?


Solution

  • Try to tune below parameters

    n_estimators

    This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable. As your data size is bigger so it will take more time for each iteration but try this.

    max_features

    These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features. Few of them are :

    min_sample_leaf

    Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.