rlistrandomsubsampling

Randomly divide df in list of df into equal subsets


yesterday I already asked a similar question: R - Randomly split a dataframe in n equal pieces

The answer I got is nearly what I need, but there are still problems with it. Also I thought about different other ways to get a result.

This is my example df-list:

set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))

AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))

df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")

I want to randomly subset the single df within the list into n equal parts (or as close as possible to equal). I already got a very helpful answer from chinsoon12:

new = lapply(df_list, function(df) {
  n <- nrow(df)
  split(df, cut(sample(n), seq(1, n, by=floor(n/4)), labels=FALSE, include.lowest=TRUE))})

The problem is that its not working for any number of rows and also not all observations are taken in account. E.g. when I devide my df_list in 5 subsets with that methode I am getting subsets of 325, 324, 324, 324, 324 for AB_df and in total thats not 1624, so something is missing. When I devide it into 4 pieces, I only get 3 subsets...any idea why this is happening?

I also thought about 2 different ways of splitting the df in the list. One way might be to just reorder the observations randomly by changing the order of the rows in a random way:

for (a in 1:length(df_list)) {
  df_list[[a]] = df_list[[a]][sample(nrow(df_list[[a]])),]}

Now I would only need to devide the dfs into n pieces...but this is the point where I am not sure how to do that.

3rd way I thought of would be to create a random list of numbers 1:n for n-subsamples and add them to the dataframes and then extract the df according to the number.

I still think the first way is the easiest and I would prefer this. Any idea whats wrong with the code?


Solution

  • The Problem resulting in your different group-sizes is a cut-thing. It does always need a hard interval-border on one side and I don't really know how to do that in your case. You could solve your problem with gl, just ignore the warnings. And when you randomize the generated levels before you apply them, you're there.

    set.seed(0L)
    AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
    BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
    DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
    FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
    
    AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
    BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
    DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
    FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
    
    df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
    names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
    
    #the number of groups you want to generate
    subs <- 4
    
    splittedList <-  lapply(df_list,
                            function(df){
                              idx <- gl(n = subs,round(nrow(df)/subs))
                              split(df, sample(idx))# randomize the groups
                            })
    #> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
    #> data length is not a multiple of split variable
    
    #> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
    #> data length is not a multiple of split variable
    
    ## the groups are appr. equally sized:
    lapply(splittedList,function(l){sapply(l,nrow)})
    #> $AB_df
    #>   1   2   3   4 
    #> 406 406 406 406 
    #> 
    #> $BC_df
    #>   1   2   3   4 
    #> 414 414 414 414 
    #> 
    #> $DE_df
    #>   1   2   3   4 
    #> 414 414 414 414 
    #> 
    #> $FG_df
    #>   1   2   3   4 
    #> 432 432 433 432 
    #> 
    #> $AB_pc
    #>   1   2   3   4 
    #> 406 406 406 406 
    #> 
    #> $BC_pc
    #>   1   2   3   4 
    #> 414 414 414 414 
    #> 
    #> $DE_pc
    #>   1   2   3   4 
    #> 414 414 414 414 
    #> 
    #> $FG_pc
    #>   1   2   3   4 
    #> 432 432 433 432
    
    ## and the sizes are right:
    sapply(df_list,nrow)
    #> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc 
    #>  1624  1656  1656  1729  1624  1656  1656  1729
    
    sapply(splittedList,function(l){sum(sapply(l,nrow))})
    #> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc 
    #>  1624  1656  1656  1729  1624  1656  1656  1729