rdataframeregressionmse

How can I calculate the mean square error in R of a regression tree?


I am working with the wine quality database.

I am studying regression trees depending on different variables as:

library(rpart)
library(rpart.plot)
library(rattle)
library(naniar)
library(dplyr)
library(ggplot2)

vinos <- read.csv(file = 'Wine.csv', header = T)

arbol0<-rpart(formula=quality~chlorides, data=vinos, method="anova")
fancyRpartPlot(arbol0)

arbol1<-rpart(formula=quality~chlorides+density, data=vinos, method="anova")
fancyRpartPlot(arbol1)

I want to calculate the mean square error to see if arbol1 is better than arbol0. I will use my own dataset since no more data is available. I have tried to do it as

aaa<-predict(object=arbol0, newdata=data.frame(chlorides=vinos$chlorides), type="anova")
bbb<-predict(object=arbol1, newdata=data.frame(chlorides=vinos$chlorides, density=vinos$density), type="anova")

and then substract manually the last column of the dataframe from aaa and bbb. However, I am getting an error. Can someone please help me?


Solution

  • This website could be useful for you. It's very important to split your dataset into train and test subsets before training your models. In the following code, I've done it with base functions, but there's another function called sample.split from the caTools package that does the same procedure. I attach you this website where you can see all the ways to split data in R.

    Remember that the function of the Mean Squared Error (MSE) is the following one:

    So, it's very simple to apply it with R. You just have to compute the mean of the squared difference between the observed (i.e, the response variable from your test subset) and predicted values (i.e, the values you have predicted from the model with the predict function).

    A solution for your wine dataset could be this one, based on the previous website.

    library(rpart)
    library(dplyr)
    library(data.table)
    
    vinos <- fread(file = 'Winequality-red.csv', header = TRUE)
    
    # Split data into train and test subsets
    sample_index <- sample(nrow(vinos), size = nrow(vinos)*0.75)
    train <- vinos[sample_index, ]
    test <- vinos[-sample_index, ]
    
    # Train regression trees models
    arbol0 <- rpart(formula = quality ~ chlorides, data = train, method = "anova")
    arbol1 <- rpart(formula = quality ~ chlorides + density, data = train, method = "anova")
    
    # Make predictions for each model
    pred0 <- predict(arbol0, newdata = test)
    pred1 <- predict(arbol1, newdata = test)
    
    # Calculate MSE for each model
    mean((pred0 - test$quality)^2)
    mean((pred1 - test$quality)^2)