machine-learningrandom-forestsample-size

Minimum number of observation when performing Random Forest


Is it possible to apply RandomForests to very small datasets? I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%). Is there any rule of thumb regarding the minimum number of observations to use? In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations. Thanks in advance


Solution

  • Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).

    I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.

    Hope this helps