The varimp
functions within R
packages party
and partykit
give different length outputs for variable importances - ie, they return different numbers of variables, with partykit
yielding smaller numbers, while party
returns the same number of variables as input to the model. The party
behaviour is expected; the partykit
behaviour seems to be truncating the variable list, but the documentation doesn't seem to indicate why or how this is done, or (ideally) how to counteract it by forcing all variables to be returned with an importance measure value.
(To be clear, it's not differences in the order of the variable
importances being considered here, only the quantity of variables returned by the importance function.)
MRE:
library(party)
library(partykit)
set.seed(303)
# generate some pseudo-data
# make empty data frame
df <- data.frame(matrix(ncol = 61, nrow = 100))
iVarNames <- paste("iVar", 1:60, sep="")
colnames(df) <- c("dVar", iVarNames)
# generate dependent variable
df['dVar'] <- runif(100, min=0, max=10)
# generate independent variables
for (icol in iVarNames){
df[icol] = df$dVar + runif(1, 0.5, 1)*rnorm(100, sd=runif(1, 1, 5))
}
# make model formula
formVar <- reformulate(iVarNames, 'dVar')
# train CRF model in party
crfParty <- party::cforest(formVar, data=df,
controls=party::cforest_unbiased(ntree=300,
mtry=as.integer(length(iVarNames)/3)))
# get variable importance from party
impParty <- sort(party::varimp(crfParty), decreasing=TRUE)
# train CRF model in partykit
crfPartykit <- partykit::cforest(formVar, data=df,
ntree=300, mtry=as.integer(length(iVarNames)/3))
# get variable importance from partykit
impPartykit <- sort(partykit::varimp(crfPartykit), decreasing=TRUE)
# compare variable importance outputs
length(impParty)
length(impPartykit)
Returns:
[1] 60
[1] 43
For context, the reason I'm comparing these packages is because I've found partykit
(using parallel
processing) to be faster than party
at calculating conditional variable importance, although I've not included those aspects in this example for the sake of making a MRE.
TL;DR In partykit
only the variables that are actually used within the forest get variable importances. And those variables that were never used for splitting at all don't receive any importances.
Illustration: To bring out the issue with a simpler example, I'm using the well-known iris
data for Species
classification, using the four available covariates Petal.Length
, Petal.Width
, Sepal.Length
, and Sepal.Width
. The former two can discriminate between the two species more easily while the latter are less suitable. Thus, when offering all four variables for splitting, typically the first two will be used most often.
Here I'm comparing two small forests (with just ntree = 10
) which either always offer all variables for splitting (mtry = 4
) or which randomly select two variables for potential splitting in each node (mtry = 2
).
library("partykit")
data("iris", package = "datasets")
set.seed(0)
cf4 <- cforest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data = iris, ntree = 10, mtry = 4)
set.seed(0)
cf2 <- cforest(Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,
data = iris, ntree = 10, mtry = 2)
The first forest only used the petal variables for splitting while the latter used all four petal and sepal variables (even though the sepal variables have almost zero importance):
set.seed(0)
varimp(cf4)
## Petal.Length Petal.Width
## 6.256181 3.132640
set.seed(0)
varimp(cf2)
## Petal.Length Petal.Width Sepal.Length Sepal.Width
## 3.86894198 6.33520311 -0.05666548 -0.09665418
More insights: If you want to gain more insights into random forests, there are also various packages that can help you with this. One of them is our stablelearner
packages, see the Variable Selection and Cutpoint Analysis of Random Forests vignette.
For example, the stabletree
class can be used for certain summaries and visualizations. Probably the simplest display is a barplot of how often each variable is used for splitting.
library("stablelearner")
st4 <- as.stabletree(cf4)
st2 <- as.stabletree(cf2)
barplot(st4, main = "cf4: Variable selection frequencies")
barplot(st2, main = "cf2: Variable selection frequencies")