I am using the caret
package for fitting different models with the same data. I am using cross-validation for all of them; however, when I use different number of folds with the lm
method, I get the same coefficients, I was expecting at least small differences. What is the reason? Is this expected?
Thanks for your time!
Here is a reprex
library(caret)
#> Loading required package: ggplot2
#> Loading required package: lattice
{
set.seed(123)
Xs <- matrix(rnorm(300*20),nrow = 300)
Y <- rnorm(300)
data <- cbind(Xs,Y) |> as.data.frame()
}
ctrlspecs_2 <- trainControl(method="cv", number=2)
ctrlspecs_10 <- trainControl(method="cv", number=10)
set.seed(123)
model_2 <- train(Y~.,
data = data,
method = "lm",
trControl = ctrlspecs_2)
set.seed(123)
model_10 <- train(Y~.,
data = data,
method = "lm",
trControl = ctrlspecs_10)
summary(model_2)
#>
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.5934 -0.6277 -0.0082 0.7448 2.2594
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.044073 0.060499 -0.728 0.46692
#> V1 -0.129567 0.065772 -1.970 0.04984 *
#> V2 -0.002505 0.061859 -0.040 0.96773
#> V3 -0.046897 0.059486 -0.788 0.43115
#> V4 0.044195 0.061427 0.719 0.47245
#> V5 0.086981 0.064085 1.357 0.17579
#> V6 0.014166 0.061001 0.232 0.81653
#> V7 -0.077959 0.060911 -1.280 0.20165
#> V8 0.017661 0.065486 0.270 0.78759
#> V9 -0.096562 0.060567 -1.594 0.11200
#> V10 0.164024 0.060858 2.695 0.00746 **
#> V11 -0.028008 0.060869 -0.460 0.64577
#> V12 0.034027 0.062118 0.548 0.58428
#> V13 -0.066028 0.066681 -0.990 0.32294
#> V14 0.142444 0.061319 2.323 0.02090 *
#> V15 -0.129046 0.060109 -2.147 0.03267 *
#> V16 -0.020873 0.061512 -0.339 0.73462
#> V17 0.046835 0.063381 0.739 0.46056
#> V18 0.035570 0.066567 0.534 0.59353
#> V19 -0.016253 0.060039 -0.271 0.78682
#> V20 -0.082083 0.060843 -1.349 0.17840
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.033 on 279 degrees of freedom
#> Multiple R-squared: 0.1041, Adjusted R-squared: 0.03986
#> F-statistic: 1.621 on 20 and 279 DF, p-value: 0.04731
summary(model_10)
#>
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.5934 -0.6277 -0.0082 0.7448 2.2594
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -0.044073 0.060499 -0.728 0.46692
#> V1 -0.129567 0.065772 -1.970 0.04984 *
#> V2 -0.002505 0.061859 -0.040 0.96773
#> V3 -0.046897 0.059486 -0.788 0.43115
#> V4 0.044195 0.061427 0.719 0.47245
#> V5 0.086981 0.064085 1.357 0.17579
#> V6 0.014166 0.061001 0.232 0.81653
#> V7 -0.077959 0.060911 -1.280 0.20165
#> V8 0.017661 0.065486 0.270 0.78759
#> V9 -0.096562 0.060567 -1.594 0.11200
#> V10 0.164024 0.060858 2.695 0.00746 **
#> V11 -0.028008 0.060869 -0.460 0.64577
#> V12 0.034027 0.062118 0.548 0.58428
#> V13 -0.066028 0.066681 -0.990 0.32294
#> V14 0.142444 0.061319 2.323 0.02090 *
#> V15 -0.129046 0.060109 -2.147 0.03267 *
#> V16 -0.020873 0.061512 -0.339 0.73462
#> V17 0.046835 0.063381 0.739 0.46056
#> V18 0.035570 0.066567 0.534 0.59353
#> V19 -0.016253 0.060039 -0.271 0.78682
#> V20 -0.082083 0.060843 -1.349 0.17840
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1.033 on 279 degrees of freedom
#> Multiple R-squared: 0.1041, Adjusted R-squared: 0.03986
#> F-statistic: 1.621 on 20 and 279 DF, p-value: 0.04731
identical(model_2$finalModel$coefficients,model_10$finalModel$coefficients)
#> [1] TRUE
Created on 2024-03-20 with reprex v2.1.0
The coefficients are the same because summary
is giving you the results of a linear model fitted on the entire dataset.
The cross-validation is done separately to work out how well the model will work on unseen data. You can see the cross-validation results using model2$resample
and model10$resample