I am working on a multi class classification for segmenting customers into 3 different classes based on their purchasing behavior and demographics. I cannot disclose the data set completely but in general it contains around 300 features and 50000 rows. I have tried the following methods but I am unable to achieve accuracy above 50% :
Is there something else I can try to improve performance (f-score, precision and recall)?
Try to tune below parameters
This is the number of trees you want to build before taking the maximum voting or averages of predictions. Higher number of trees give you better performance but makes your code slower. You should choose as high value as your processor can handle because this makes your predictions stronger and more stable. As your data size is bigger so it will take more time for each iteration but try this.
These are the maximum number of features Random Forest is allowed to try in individual tree. There are multiple options available in Python to assign maximum features. Few of them are :
Auto/None : This will simply take all the features which make sense
in every tree.Here we simply do not put any restrictions on the
individual tree.
sqrt : This option will take square root of the total number of features in individual run. For instance, if the total number of variables are 100, we can only take 10 of them in individual tree.”log2″ is another similar type of option for max_features.
0.2 : This option allows the random forest to take 20% of variables in individual run. We can assign and value in a format “0.x” where we want x% of features to be considered.
Leaf is the end node of a decision tree. A smaller leaf makes the model more prone to capturing noise in train data. You can start with some minimum value like 75 and gradually increase it. See which value your accuracy is coming high.