yesterday I already asked a similar question: R - Randomly split a dataframe in n equal pieces
The answer I got is nearly what I need, but there are still problems with it. Also I thought about different other ways to get a result.
This is my example df-list:
set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
I want to randomly subset the single df within the list into n equal parts (or as close as possible to equal). I already got a very helpful answer from chinsoon12:
new = lapply(df_list, function(df) {
n <- nrow(df)
split(df, cut(sample(n), seq(1, n, by=floor(n/4)), labels=FALSE, include.lowest=TRUE))})
The problem is that its not working for any number of rows and also not all observations are taken in account. E.g. when I devide my df_list in 5 subsets with that methode I am getting subsets of 325, 324, 324, 324, 324 for AB_df and in total thats not 1624, so something is missing. When I devide it into 4 pieces, I only get 3 subsets...any idea why this is happening?
I also thought about 2 different ways of splitting the df in the list. One way might be to just reorder the observations randomly by changing the order of the rows in a random way:
for (a in 1:length(df_list)) {
df_list[[a]] = df_list[[a]][sample(nrow(df_list[[a]])),]}
Now I would only need to devide the dfs into n pieces...but this is the point where I am not sure how to do that.
3rd way I thought of would be to create a random list of numbers 1:n for n-subsamples and add them to the dataframes and then extract the df according to the number.
I still think the first way is the easiest and I would prefer this. Any idea whats wrong with the code?
The Problem resulting in your different group-sizes is a cut-thing. It does always need a hard interval-border on one side and I don't really know how to do that in your case.
You could solve your problem with gl
, just ignore the warnings.
And when you randomize the generated levels before you apply them, you're there.
set.seed(0L)
AB_df = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_df = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_df = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
AB_pc = data.frame(replicate(2,sample(0:130,1624,rep=TRUE)))
BC_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
DE_pc = data.frame(replicate(2,sample(0:130,1656,rep=TRUE)))
FG_pc = data.frame(replicate(2,sample(0:130,1729,rep=TRUE)))
df_list = list(AB_df, BC_df, DE_df, FG_df, AB_pc, BC_pc, DE_pc, FG_pc)
names(df_list) = c("AB_df", "BC_df", "DE_df", "FG_df", "AB_pc", "BC_pc", "DE_pc", "FG_pc")
#the number of groups you want to generate
subs <- 4
splittedList <- lapply(df_list,
function(df){
idx <- gl(n = subs,round(nrow(df)/subs))
split(df, sample(idx))# randomize the groups
})
#> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
#> data length is not a multiple of split variable
#> Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...):
#> data length is not a multiple of split variable
## the groups are appr. equally sized:
lapply(splittedList,function(l){sapply(l,nrow)})
#> $AB_df
#> 1 2 3 4
#> 406 406 406 406
#>
#> $BC_df
#> 1 2 3 4
#> 414 414 414 414
#>
#> $DE_df
#> 1 2 3 4
#> 414 414 414 414
#>
#> $FG_df
#> 1 2 3 4
#> 432 432 433 432
#>
#> $AB_pc
#> 1 2 3 4
#> 406 406 406 406
#>
#> $BC_pc
#> 1 2 3 4
#> 414 414 414 414
#>
#> $DE_pc
#> 1 2 3 4
#> 414 414 414 414
#>
#> $FG_pc
#> 1 2 3 4
#> 432 432 433 432
## and the sizes are right:
sapply(df_list,nrow)
#> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc
#> 1624 1656 1656 1729 1624 1656 1656 1729
sapply(splittedList,function(l){sum(sapply(l,nrow))})
#> AB_df BC_df DE_df FG_df AB_pc BC_pc DE_pc FG_pc
#> 1624 1656 1656 1729 1624 1656 1656 1729