I am trying to create a null expectation in R. The dataset has four species and associated sample values of x.real
and y.real
. I want to create a null expectation by only shuffling the x.real for each species. Estimate slope. Repeat 1000 times, say. And then calculate the mean of the slope.
Here is my attempt. I get an error that says Error in filter(., sp == "A") : object '*tmp*' not found
Any suggestions on how to correct this to get the desired output (attached).
set.seed(111)
library(truncnorm)
x.real <- rtruncnorm(n = 288,a = 0,b = 10,mean = 5,sd = 2)
y.real <- rnorm(288,0,4)
sp <- rep(c("A","B","C","D"), each = 72)
df <- data.frame(x.real, y.real, sp)
output <- tibble(y.intercept = numeric(),
slope = numeric(),
sp = character(),
set = numeric())
set.seed(42)
for(i in 1:1000){
samp1 <- df %>% filter(sp == 'A') %>% mod1 <- lm(y.real ~ sample(x.real, length(x.real), replace = TRUE)) %>% summarise(y.intercept = mean(mod1$coefficients[1]), slope = mean(mod1$coefficients[2])) %>% mutate(set = i)
samp2 <- df %>% filter(sp == 'B') %>% mod2 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod2$coefficients[1]), slope = mean(mod2$coefficients[2])) %>% mutate(set = i)
samp3 <- df %>% filter(sp == 'C') %>% mod3 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod3$coefficients[1]), slope = mean(mod3$coefficients[2])) %>% mutate(set = i)
samp4 <- df %>% filter(sp == 'D') %>% mod4 <-lm(y.real ~ sample(x.real, length(x.real),replace = TRUE)) %>% summarise(y.intercept = mean(mod4$coefficients[1]), slope = mean(mod4$coefficients[2])) %>% mutate(set = i)
output %>% add_row(bind_rows(samp1, samp2, samp3, samp4)) -> output
}
y.intercept slope sp
2 2 A
3 1 B
1 -4 C
2 -1 D
You can do this as follows:
df
as data.table
f
that estimates the model for a given x,y, and returns the coefficientsrbindlist
(this will return a data.table of length 1000x4x2 = 8000 rows)term
) indicating which rows are intercept and which are slopesp
and term
, get the mean of the coefficient# set df as data.table
setDT(df)
# function to estimate model coefficients
f <- function(x,y) lm(y~sample(x, length(x), replace=T))$coef
# 1000 models for each species
result = rbindlist(
lapply(1:1000, \(i) df[,.(est = f(x.real,y.real)), sp][, i:=i])
)
# add a columns for intercept/slope
result[, terms:=rep(c("intercept", "slope"), length.out=nrow(result))]
# get mean of intercept/slope by species, and use `dcast` to turn to wide format
dcast(
result[, .(mean_coef = mean(est)), .(sp,terms)],
sp~terms, value.var = "mean_coef"
)
Output:
sp intercept slope
1: A -0.6161870 0.001203426
2: B 0.2314204 0.001683623
3: C -0.7333392 0.008367585
4: D 0.1584692 -0.005298080
If you want to follow the tidyverse/for loop approach you were using, here are some edits:
library(tidyverse)
output <- tibble(est = numeric(),
sp = character(),
set = numeric())
set.seed(42)
for(i in 1:1000){
sampa = df %>% filter(sp=="A") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="A", set=i)
sampb = df %>% filter(sp=="B") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="B", set=i)
sampc = df %>% filter(sp=="C") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="C", set=i)
sampd = df %>% filter(sp=="D") %>% summarize(est = lm(y.real~sample(x.real, n(), replace=T))$coef) %>% mutate(sp="D", set=i)
output %>% add_row(bind_rows(sampa, sampb, sampc, sampd)) -> output
}
output %>%
mutate(term = rep(c("intercept", "slope"), length.out=nrow(output))) %>%
group_by(sp, term) %>%
summarize(mean_coef = mean(est)) %>%
pivot_wider(sp, names_from=term, values_from=mean_coef)
Similarly, here, output
has 8000 rows (one row for each of 1000 iterations x 2 coefficients x 4 species). I similarly use mutate to add a term
column, and then, grouping by species and term, get the mean of the estimate, and pivot wider to get 4 rows:
sp intercept slope
<chr> <dbl> <dbl>
1 A -0.603 -0.000827
2 B 0.272 -0.00587
3 C -0.692 0.000627
4 D 0.146 -0.00351
I found the approach with data.table
and lapply()
to be about 10 times faster.