rrandom-forestfeature-selectionparty

Why varimp functions in R packages party & partykit return different numbers of variables & how can partykit::varimp be used to return all variables


The varimp functions within R packages party and partykit give different length outputs for variable importances - ie, they return different numbers of variables, with partykit yielding smaller numbers, while party returns the same number of variables as input to the model. The party behaviour is expected; the partykit behaviour seems to be truncating the variable list, but the documentation doesn't seem to indicate why or how this is done, or (ideally) how to counteract it by forcing all variables to be returned with an importance measure value. (To be clear, it's not differences in the order of the variable importances being considered here, only the quantity of variables returned by the importance function.)

MRE:

library(party)
library(partykit)

set.seed(303)

# generate some pseudo-data
# make empty data frame
df <- data.frame(matrix(ncol = 61, nrow = 100))
iVarNames <- paste("iVar", 1:60, sep="")
colnames(df) <- c("dVar", iVarNames)

# generate dependent variable
df['dVar'] <- runif(100, min=0, max=10)

# generate independent variables
for (icol in iVarNames){
  df[icol] = df$dVar + runif(1, 0.5, 1)*rnorm(100, sd=runif(1, 1, 5))
}

# make model formula
formVar <- reformulate(iVarNames, 'dVar')

# train CRF model in party
crfParty <- party::cforest(formVar, data=df,
                           controls=party::cforest_unbiased(ntree=300,
                                                            mtry=as.integer(length(iVarNames)/3)))
# get variable importance from party
impParty <- sort(party::varimp(crfParty), decreasing=TRUE)

# train CRF model in partykit
crfPartykit <- partykit::cforest(formVar, data=df,
                                 ntree=300, mtry=as.integer(length(iVarNames)/3))
# get variable importance from partykit
impPartykit <- sort(partykit::varimp(crfPartykit), decreasing=TRUE)

# compare variable importance outputs
length(impParty)
length(impPartykit)

Returns:

[1] 60
[1] 43

For context, the reason I'm comparing these packages is because I've found partykit (using parallel processing) to be faster than party at calculating conditional variable importance, although I've not included those aspects in this example for the sake of making a MRE.


Solution

  • TL;DR In partykit only the variables that are actually used within the forest get variable importances. And those variables that were never used for splitting at all don't receive any importances.

    Illustration: To bring out the issue with a simpler example, I'm using the well-known iris data for Species classification, using the four available covariates Petal.Length, Petal.Width, Sepal.Length, and Sepal.Width. The former two can discriminate between the two species more easily while the latter are less suitable. Thus, when offering all four variables for splitting, typically the first two will be used most often.

    Here I'm comparing two small forests (with just ntree = 10) which either always offer all variables for splitting (mtry = 4) or which randomly select two variables for potential splitting in each node (mtry = 2).

    library("partykit")
    data("iris", package = "datasets")
    set.seed(0)
    cf4 <- cforest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
      data = iris, ntree = 10, mtry = 4)
    set.seed(0)
    cf2 <- cforest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
      data = iris, ntree = 10, mtry = 2)
    

    The first forest only used the petal variables for splitting while the latter used all four petal and sepal variables (even though the sepal variables have almost zero importance):

    set.seed(0)
    varimp(cf4)
    ## Petal.Length  Petal.Width 
    ##     6.256181     3.132640 
    set.seed(0)
    varimp(cf2)
    ## Petal.Length  Petal.Width Sepal.Length  Sepal.Width 
    ##   3.86894198   6.33520311  -0.05666548  -0.09665418 
    

    More insights: If you want to gain more insights into random forests, there are also various packages that can help you with this. One of them is our stablelearner packages, see the Variable Selection and Cutpoint Analysis of Random Forests vignette.

    For example, the stabletree class can be used for certain summaries and visualizations. Probably the simplest display is a barplot of how often each variable is used for splitting.

    library("stablelearner")
    st4 <- as.stabletree(cf4)
    st2 <- as.stabletree(cf2)
    barplot(st4, main = "cf4: Variable selection frequencies")
    barplot(st2, main = "cf2: Variable selection frequencies")
    

    stabletree barplots