This example code creates a dataframe with the initial column representing the majority vote from the 10 trees. The next 10 columns contain the categorical vote of each tree in the model. I want to create a chart to show the distribution of votes for each row. What is the best way to do that?
library(tidyverse)
library(caret)
library(randomForest)
train_index_cars <- as.vector(createDataPartition(mtcars[['cyl']],p=.8,list=FALSE,times=1))
mytrain <- mtcars[train_index_cars, ]
mytest <- mtcars[-train_index_cars, ]
car_forest <- randomForest(factor(cyl) ~., data= mytrain, ntree = 10, predict.all = T)
cartest_predicted <- as.data.frame(predict(car_forest, newdata = mytest, predict.all = TRUE))
The output of the Merc 280 row in cartest_predicted
looks something like this(excluding the last 6 trees)
id | aggregrate | individual.1 | individual.2 | individual.3 | individual.4 |
---|---|---|---|---|---|
Merc 280 | 6 | 6 | 8 | 6 | 4 |
I'd like to add three columns to each row that contains the count of the votes for each category (4, 6, 8) across trees. I'm envisioning the output like this:
individual.10 | Votes_4 | Votes_6 | Votes_8 |
---|---|---|---|
6 | 2 | 7 | 1 |
What is the best way to sum the columns across rows by condition? I can't seem to find exactly what I need out there. Does this output already exist as part of the randomForest package and I'm just overlooking it?
This should work:
# Defining temporarily function, to be passed within apply().
temp.fun = function(x) sum(x == i)
for (i in unique(cartest_predicted$aggregate)) # Iterating over possible votes.
{
i = as.integer(i)
cartest_predicted$temp = apply(cartest_predicted[, -1], MARGIN = 1, temp.fun) # Requested results.
colnames(cartest_predicted)[dim(cartest_predicted)[[2]]] = paste("Votes", i, sep = "_") # Renaming new column.
}
The for
loop is needed to iterate over all the possible votes that trees can cast. A temporarily function is created to perform the operation you need for each i
, that is, for each possible vote. Such function is then used within apply()
to be applied to each row of cartest_predict
(notice MARGIN = 1
). Finally, paste
is used to rename columns.