rbootstrap-4shufflerandomized-algorithm

How to randomize or shuffle only one-axis and estimate estimated correlation in R?


I am trying to create a null expectation in R. The dataset has four species and associated sample values of x.real and y.real. I want to create a null expectation by only shuffling the x.real for each species. Estimate slope. Repeat 1000 times, say. And then calculate the mean of the slope.

Here is my attempt. I get an error that says Error in filter(., sp == "A") : object '*tmp*' not found Any suggestions on how to correct this to get the desired output (attached).

set.seed(111)
library(truncnorm)
x.real <- rtruncnorm(n = 288,a = 0,b = 10,mean = 5,sd = 2)
y.real <- rnorm(288,0,4)
sp <- rep(c("A","B","C","D"), each = 72)

df <- data.frame(x.real, y.real, sp)

output <- tibble(y.intercept = numeric(),
                 slope = numeric(),
                 sp = character(),
                 set = numeric())

set.seed(42)                    
for(i in 1:1000){
  samp1 <- df %>% filter(sp == 'A') %>% mod1 <- lm(y.real ~ sample(x.real, length(x.real), replace = TRUE)) %>% summarise(y.intercept = mean(mod1$coefficients[1]), slope =  mean(mod1$coefficients[2])) %>% mutate(set = i)
  samp2 <- df %>% filter(sp == 'B') %>% mod2 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod2$coefficients[1]), slope =  mean(mod2$coefficients[2])) %>% mutate(set = i)
  samp3 <- df %>% filter(sp == 'C') %>% mod3 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod3$coefficients[1]), slope =  mean(mod3$coefficients[2])) %>% mutate(set = i)
  samp4 <- df %>% filter(sp == 'D') %>% mod4 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod4$coefficients[1]), slope =  mean(mod4$coefficients[2])) %>% mutate(set = i)
  output %>% add_row(bind_rows(samp1, samp2, samp3, samp4))  -> output
}

 y.intercept slope sp
     2         2    A
     3         1    B
     1        -4    C
     2        -1    D
 

Solution

  • You can do this as follows:

    # set df as data.table
    setDT(df)
    
    # function to estimate model coefficients
    f <- function(x,y) lm(y~sample(x, length(x), replace=T))$coef
    
    # 1000 models for each species
    result = rbindlist(
      lapply(1:1000, \(i) df[,.(est = f(x.real,y.real)), sp][, i:=i])
    )
    
    # add a columns for intercept/slope
    result[, terms:=rep(c("intercept", "slope"), length.out=nrow(result))]
    
    # get mean of intercept/slope by species, and use `dcast` to turn to wide format
    
    dcast(
      result[, .(mean_coef = mean(est)), .(sp,terms)],
      sp~terms, value.var = "mean_coef"
    )
    

    Output:

       sp  intercept        slope
    1:  A -0.6161870  0.001203426
    2:  B  0.2314204  0.001683623
    3:  C -0.7333392  0.008367585
    4:  D  0.1584692 -0.005298080
    

    If you want to follow the tidyverse/for loop approach you were using, here are some edits:

    library(tidyverse)
    output <- tibble(est = numeric(),
                     sp = character(),
                     set = numeric())
    set.seed(42)                    
    for(i in 1:1000){
      sampa = df %>% filter(sp=="A") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="A", set=i)
      sampb = df %>% filter(sp=="B") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="B", set=i)
      sampc = df %>% filter(sp=="C") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="C", set=i)
      sampd = df %>% filter(sp=="D") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="D", set=i)
      output %>% add_row(bind_rows(sampa, sampb, sampc, sampd))  -> output
    }
    output %>% 
      mutate(term = rep(c("intercept", "slope"), length.out=nrow(output))) %>% 
      group_by(sp, term) %>% 
      summarize(mean_coef = mean(est)) %>% 
      pivot_wider(sp, names_from=term, values_from=mean_coef)
    

    Similarly, here, output has 8000 rows (one row for each of 1000 iterations x 2 coefficients x 4 species). I similarly use mutate to add a term column, and then, grouping by species and term, get the mean of the estimate, and pivot wider to get 4 rows:

      sp    intercept     slope
      <chr>     <dbl>     <dbl>
    1 A        -0.603 -0.000827
    2 B         0.272 -0.00587 
    3 C        -0.692  0.000627
    4 D         0.146 -0.00351 
    

    I found the approach with data.table and lapply() to be about 10 times faster.