rcross-validationlmcaret

Why do I get the same results with different cross-validation specifications in caret for `lm`


I am using the caret package for fitting different models with the same data. I am using cross-validation for all of them; however, when I use different number of folds with the lm method, I get the same coefficients, I was expecting at least small differences. What is the reason? Is this expected?

Thanks for your time!

Here is a reprex

library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice

{
set.seed(123)
Xs <- matrix(rnorm(300*20),nrow = 300)
Y <- rnorm(300)
data <- cbind(Xs,Y) |> as.data.frame()
}

ctrlspecs_2 <- trainControl(method="cv", number=2)
ctrlspecs_10 <- trainControl(method="cv", number=10)

set.seed(123)
model_2 <- train(Y~.,
                 data = data,
                 method = "lm",
                 trControl = ctrlspecs_2)

set.seed(123)
model_10 <- train(Y~.,
                 data = data,
                 method = "lm",
                 trControl = ctrlspecs_10)

summary(model_2)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.5934 -0.6277 -0.0082  0.7448  2.2594 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)   
#> (Intercept) -0.044073   0.060499  -0.728  0.46692   
#> V1          -0.129567   0.065772  -1.970  0.04984 * 
#> V2          -0.002505   0.061859  -0.040  0.96773   
#> V3          -0.046897   0.059486  -0.788  0.43115   
#> V4           0.044195   0.061427   0.719  0.47245   
#> V5           0.086981   0.064085   1.357  0.17579   
#> V6           0.014166   0.061001   0.232  0.81653   
#> V7          -0.077959   0.060911  -1.280  0.20165   
#> V8           0.017661   0.065486   0.270  0.78759   
#> V9          -0.096562   0.060567  -1.594  0.11200   
#> V10          0.164024   0.060858   2.695  0.00746 **
#> V11         -0.028008   0.060869  -0.460  0.64577   
#> V12          0.034027   0.062118   0.548  0.58428   
#> V13         -0.066028   0.066681  -0.990  0.32294   
#> V14          0.142444   0.061319   2.323  0.02090 * 
#> V15         -0.129046   0.060109  -2.147  0.03267 * 
#> V16         -0.020873   0.061512  -0.339  0.73462   
#> V17          0.046835   0.063381   0.739  0.46056   
#> V18          0.035570   0.066567   0.534  0.59353   
#> V19         -0.016253   0.060039  -0.271  0.78682   
#> V20         -0.082083   0.060843  -1.349  0.17840   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.033 on 279 degrees of freedom
#> Multiple R-squared:  0.1041, Adjusted R-squared:  0.03986 
#> F-statistic: 1.621 on 20 and 279 DF,  p-value: 0.04731
summary(model_10)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -3.5934 -0.6277 -0.0082  0.7448  2.2594 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)   
#> (Intercept) -0.044073   0.060499  -0.728  0.46692   
#> V1          -0.129567   0.065772  -1.970  0.04984 * 
#> V2          -0.002505   0.061859  -0.040  0.96773   
#> V3          -0.046897   0.059486  -0.788  0.43115   
#> V4           0.044195   0.061427   0.719  0.47245   
#> V5           0.086981   0.064085   1.357  0.17579   
#> V6           0.014166   0.061001   0.232  0.81653   
#> V7          -0.077959   0.060911  -1.280  0.20165   
#> V8           0.017661   0.065486   0.270  0.78759   
#> V9          -0.096562   0.060567  -1.594  0.11200   
#> V10          0.164024   0.060858   2.695  0.00746 **
#> V11         -0.028008   0.060869  -0.460  0.64577   
#> V12          0.034027   0.062118   0.548  0.58428   
#> V13         -0.066028   0.066681  -0.990  0.32294   
#> V14          0.142444   0.061319   2.323  0.02090 * 
#> V15         -0.129046   0.060109  -2.147  0.03267 * 
#> V16         -0.020873   0.061512  -0.339  0.73462   
#> V17          0.046835   0.063381   0.739  0.46056   
#> V18          0.035570   0.066567   0.534  0.59353   
#> V19         -0.016253   0.060039  -0.271  0.78682   
#> V20         -0.082083   0.060843  -1.349  0.17840   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 1.033 on 279 degrees of freedom
#> Multiple R-squared:  0.1041, Adjusted R-squared:  0.03986 
#> F-statistic: 1.621 on 20 and 279 DF,  p-value: 0.04731

identical(model_2$finalModel$coefficients,model_10$finalModel$coefficients)
#> [1] TRUE

Created on 2024-03-20 with reprex v2.1.0


Solution

  • The coefficients are the same because summary is giving you the results of a linear model fitted on the entire dataset.

    The cross-validation is done separately to work out how well the model will work on unseen data. You can see the cross-validation results using model2$resample and model10$resample