I'm trying to create a decision tree to predict whether a given loan applicant would default or repay their debt.
I'm using the following dataset
library(readr)
library(dplyr)
library(rpart)
library(rpart.plot)
loans <- read_csv('https://assets.datacamp.com/production/repositories/718/datasets/7805fceacfb205470c0e8800d4ffc37c6944b30c/loans.csv')
Since the response variable default
is encoded as dbl
, I convert it to chr
first and then fct
type variable to use it in my classification model.
loans <- loans %>% mutate(default = factor(as.character(default), levels = c(0, 1), labels = c('repaid', 'defaulted')))
Now, I start building the recursive partitioning (rpart()
) object, loans_model
: The response variable is default
and the explanatory variables are loan_amount + credit_score + debt_to_income
.
loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income, data = loans, method = 'class')
When I make predictions with this model, all the predicted values get the same value, repaid
.
loans$pred_default <- predict(loans_model, newdata = loans, type = "class")
unique(unique(loans$pred_default)
Output:
[1] repaid
Levels: repaid defaulted
Also when I try to visualize the decision tree, I get only one node (the root).
rpart.plot(loan_model)
Why does the model I built not make appropriate predictions?
You need to tinker with the cp
argument (complexity parameter), which controls the process of splitting each variable. The default is 0.01. If you set this to -1, and set the maxdepth
argument to 3, then you get something more interesting, at least for a start.
loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income,
data = loans,
method = 'class',
cp=-1,
maxdepth = 3)
rpart.plot(loans_model, cex=0.7)
On page 21 of the longintro.pdf, "The default value (for cp) of .01 has been reasonably successful at ‘pre-pruning’ trees so that the cross-validation step need only remove 1 or 2 layers, but it sometimes over prunes, particularly for large data sets."