rfor-looplapplysimulateiris-dataset

Running analysis on for loop x times


I have the following code that selects 4 rows of iris 1000x, and takes the mean of each 4 row sample:

library(dplyr)

iris<- iris

storage<- list()


counter<- 0
for (i in 1:1000) {
  # sample 3 randomly selected transects 100 time
  tempsample<- iris[sample(1:nrow(iris), 4, replace=F),]

  storage[[i]]=tempsample

  
  counter<- counter+1
  print(counter)
}

# Unpack results into dataframe 
results<- do.call(rbind, storage)
View(results)

results_2<- as.data.frame(results)
results_2<- results_2 %>% mutate(Aggregate = rep(seq(1,ceiling(nrow(results_2)/4)),each = 4))
# View(results_2)


final_results<- aggregate(results_2[,1:4], list(results_2$Aggregate), mean)
# View(final_results)

I want to calculate the bias of each column in relation to their true population parameter. For example using SimDesign's bias():

library(SimDesign)
(bias(final_results[,2:5], parameter=c(5,3,2,1), type='relative'))*100

In this code, the values of parameter are hypothetical true pop. values of each column in the dataframe. I want to do this process 100x to get a distribution of bias estimates for each variable in the dataframe. However, I'm not sure how to fit all of this into a for loop (what I think would be the way to go) so the final output is a dataframe with 100 rows of bias measurements for each iris variable.

Any help with this would be greatly appreciated.

#------------------------------

Update

Trying to run the same code for a stratified sample as opposed to a random sample gives me the following error: *Error in [.data.table(setDT(copy(iris)), as.vector(sapply(1:1000, function(X) stratified(iris, : i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT * I think this might be related to setDT?

This is a result of the following code:

do.call(rbind,lapply(1:100, function(x) {
  bias(
    setDT(copy(iris))[as.vector(sapply(1:1000, function(X) stratified(iris,group="Species", size=1)))][
      , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
    parameter=c(5,3,2,1), 
    type='relative'
  )
}))

I looked into using the following code which was suggested:

get_samples <- function(n, sampsize=4) {
  rbindlist(lapply(1:n, function(x) { 
    splitstackshape::stratified(iris, group="Species",sampsize)[, id:=x]   }))[
      , lapply(.SD, mean), by=.(Species, id)] }

I think I understand what this function is doing (selecting 4 stratified rows of iris, taking the means of each column by species), but I'm not sure how to apply it to the original question of doing it (4*1000)*100 to test the bias (I'm very new at this so apologies if I'm missing something obvious).


Solution

  • Here is one way to do that. I've made some minor changes to your code, and wrapped it in a function. Then, use lapply over a sequence say 1:10 or 1:100, each time running your function, and feeding the result to your bias function from the SimDesign package. Then row bind the resulting list

    library(dplyr)
    
    get_samples <- function(df, size=4, n=1000) {
    
      storage<- list()
      counter<- 0
      
      for (i in 1:1000) {
        tempsample<- df[sample(1:nrow(df), size, replace=F),]
        storage[[i]]=tempsample
        counter<- counter+1
      }
      
      results<- do.call(rbind, storage)
      results_2<- as.data.frame(results)
      results_2<- results_2 %>% mutate(Aggregate = rep(seq(1,ceiling(nrow(results_2)/size)),each = size))
      final_results<- aggregate(results_2[,1:size], list(results_2$Aggregate), mean)
      return(final_results)
    }
    
    
    iris=iris
    
    replicates = lapply(1:10, function(x) {
      result = get_samples(iris)
      (bias(result[,2:5], parameter=c(5,3,2,1), type='relative'))*100
    })
    
    replicates = do.call(rbind, replicates)
    

    Output:

          Sepal.Length Sepal.Width Petal.Length Petal.Width
     [1,]     41.50617    3.292500     86.77408    8.859333
     [2,]     43.26058    2.763500     90.20758   10.825917
     [3,]     43.46642    3.551750     90.11767   10.576250
     [4,]     41.94683    2.970833     86.89625    8.817000
     [5,]     42.08733    3.380917     86.78642    8.996667
     [6,]     42.13050    2.942250     88.02983    9.707500
     [7,]     43.07383    2.775500     89.04583   10.102083
     [8,]     44.10192    2.895167     91.27208   11.188500
     [9,]     41.29408    2.314750     87.59208    9.244333
    [10,]     42.77450    2.781583     90.37342   10.789500
    

    Fast approach to the problem

    library(SimDesign)
    library(data.table)
    
    do.call(rbind,lapply(1:100, function(x) {
      bias(
        setDT(copy(iris))[as.vector(sapply(1:1000, function(X) sample(1:nrow(iris),4)))][
          , lapply(.SD, mean), by=rep(c(1:1000),4), .SDcols=c(1:4)][,c(2:5)],
        parameter=c(5,3,2,1), 
        type='relative'
      )
    }))