machine-learningrandom-forestfeature-selection

Random Forest Classifier Removing Features using Top-N Features Method


I am trying to predict the winner of an NBA game using a random forest classifier. I have sought to remove and modify my list of features so that I can increase accuracy and decrease noise.

I implemented the solution found here: https://datascience.stackexchange.com/questions/57697/decision-trees-should-we-discard-low-importance-features, where I would loop through the top N most important features and plot out the resulting accuracy. After all my features have gone through that loop, I'm left with a plot that looks like this: enter image description here

As you can see, the resulting graph is kind of all over the place. Do I remove the features that have a negative slope? Or what's the threshold to removing features? Is there a better way to calculate noise? How would I get the most accurate model given that I have so many features with such a variable impact on my model accuracy on training data?


Solution

  • As a starting point, you could try some feature selection techniques that are easier to understand. This is what I would try based on the small subset of techniques that I am familiar and comfortable with...

    If you have continuous variables, plot a correlation matrix and remove highly correlated features to eliminate multicollinearity. If your features are categorical, you could try ANOVA. If you have a large number of features, a small sample size, and nonlinear relationships between features, you could investigate dimensionality reduction techniques like PCA.