rmachine-learningclassificationdecision-treerpart

rpart() decision tree fails to generate splits (decision tree with only one node (the root node))


I'm trying to create a decision tree to predict whether a given loan applicant would default or repay their debt.

I'm using the following dataset

library(readr)
library(dplyr)
library(rpart)
library(rpart.plot)

loans <- read_csv('https://assets.datacamp.com/production/repositories/718/datasets/7805fceacfb205470c0e8800d4ffc37c6944b30c/loans.csv')

Since the response variable default is encoded as dbl, I convert it to chr first and then fct type variable to use it in my classification model.

loans <- loans %>% mutate(default = factor(as.character(default), levels = c(0, 1), labels = c('repaid', 'defaulted')))

Now, I start building the recursive partitioning (rpart()) object, loans_model: The response variable is default and the explanatory variables are loan_amount + credit_score + debt_to_income.

loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income, data = loans, method = 'class')

When I make predictions with this model, all the predicted values get the same value, repaid.

loans$pred_default <- predict(loans_model, newdata = loans, type = "class")

unique(unique(loans$pred_default)

Output:

[1] repaid
Levels: repaid defaulted

Also when I try to visualize the decision tree, I get only one node (the root).

rpart.plot(loan_model)

loan_model_plot

Why does the model I built not make appropriate predictions?


Solution

  • You need to tinker with the cp argument (complexity parameter), which controls the process of splitting each variable. The default is 0.01. If you set this to -1, and set the maxdepth argument to 3, then you get something more interesting, at least for a start.

    loans_model <- rpart(default ~ loan_amount + credit_score + debt_to_income, 
                         data = loans, 
                         method = 'class',
                         cp=-1,
                         maxdepth = 3)
    
    rpart.plot(loans_model, cex=0.7)
    

    enter image description here

    On page 21 of the longintro.pdf, "The default value (for cp) of .01 has been reasonably successful at ‘pre-pruning’ trees so that the cross-validation step need only remove 1 or 2 layers, but it sometimes over prunes, particularly for large data sets."