rrandomcross-validationtraining-datatest-data

Split data in 5 subsets with choose(k,n) & NOT with sample()


I want to split train and test but with choose() function not with sample() in R.

I have 58 rows and 28 columns on my dataset (a csv file ) and I want to do a 10-fold or 5-fold CV on this dataset.

How am I going to write the code down for this task ?

I`ve tried:

set.seed(1)
smp_size=choose(58,5, name_dataset) # which is totally wrong but ... 
# I haven't figured out yet how to take 5 subsets from 58 observations
# each time I do a 5/10 -fold  CV

train_ind=sample(seq_len(nrow(name_dataset)),size=smp_size) # I think sample here is wrong too
train=name_dataset[train_ind,]
test=name_dataset[-train_ind,]

Solution

  • I don't know what you mean by every possible combination of 5-subset. That seems like an incredibly large amount of possibilities. I assume you mean that you want a subset of 5 datasets that contain all of the samples in your dataset. I would probably do something like this. We first make a vector of groups that is the number of k and the length of the dataset. We then sample the groups randomly and split the dataset by these groupings.

    library(tidyverse)
    
    set.seed(3465)
    test_data <- tibble(A = runif(58),
                        B = runif(58))
    
    
    k_split <- function(dat,k, seed = 1){
      set.seed(seed)
      grp <- rep(1:k, length.out = nrow(dat))
      dat |>
        mutate(grp = sample(grp, nrow(dat), replace = F)) |>
        group_split(grp)|>
        map(\(d) select(d, -grp))
    }
    
    k_split(test_data, 5)
    #> [[1]]
    #> # A tibble: 12 x 2
    #>        A      B
    #>    <dbl>  <dbl>
    #>  1 0.476 0.468 
    #>  2 0.636 0.639 
    #>  3 0.334 0.0269
    #>  4 0.668 0.220 
    #>  5 0.398 0.919 
    #>  6 0.343 0.748 
    #>  7 0.799 0.526 
    #>  8 0.710 0.759 
    #>  9 0.737 0.927 
    #> 10 0.819 0.441 
    #> 11 0.852 0.656 
    #> 12 0.416 0.541 
    #> 
    #> [[2]]
    #> # A tibble: 12 x 2
    #>         A      B
    #>     <dbl>  <dbl>
    #>  1 0.0107 0.905 
    #>  2 0.109  0.539 
    #>  3 0.715  0.778 
    #>  4 0.523  0.416 
    #>  5 0.609  0.357 
    #>  6 0.152  0.0972
    #>  7 0.919  0.450 
    #>  8 0.866  0.510 
    #>  9 0.0347 0.0890
    #> 10 0.862  0.465 
    #> 11 0.364  0.765 
    #> 12 0.789  0.601 
    #> 
    #> [[3]]
    #> # A tibble: 12 x 2
    #>         A      B
    #>     <dbl>  <dbl>
    #>  1 0.580  0.228 
    #>  2 0.201  0.0418
    #>  3 0.0359 0.417 
    #>  4 0.521  0.758 
    #>  5 0.534  0.974 
    #>  6 0.580  0.563 
    #>  7 0.844  0.781 
    #>  8 0.756  0.271 
    #>  9 0.211  0.533 
    #> 10 0.851  0.764 
    #> 11 0.885  0.150 
    #> 12 0.262  0.371 
    #> 
    #> [[4]]
    #> # A tibble: 11 x 2
    #>         A     B
    #>     <dbl> <dbl>
    #>  1 0.556  0.313
    #>  2 0.353  0.821
    #>  3 0.0959 0.861
    #>  4 0.759  0.261
    #>  5 0.207  0.772
    #>  6 0.668  0.527
    #>  7 0.150  0.788
    #>  8 0.0939 0.257
    #>  9 0.0913 0.817
    #> 10 0.294  0.790
    #> 11 0.0224 0.253
    #> 
    #> [[5]]
    #> # A tibble: 11 x 2
    #>          A      B
    #>      <dbl>  <dbl>
    #>  1 0.0893  0.665 
    #>  2 0.966   0.142 
    #>  3 0.672   0.0849
    #>  4 0.641   0.155 
    #>  5 0.490   0.187 
    #>  6 0.00394 0.295 
    #>  7 0.126   0.813 
    #>  8 0.202   0.474 
    #>  9 0.0740  0.107 
    #> 10 0.412   0.709 
    #> 11 0.509   0.253