rregressionp-valuesample-sizesimulation

Simulating thousands of regressions and obtaining p-values


I'm looking to do some basic simulation in R to examine the nature of p-values. My goal is to see whether large sample sizes trend towards small p-values. My thought is to generate random vectors of 1,000,000 data points, regress them on each other, and then plot the distribution of p-values and look for skew.

This is what I was thinking so far:

x1 = runif(1000000, 0, 1000) 
x2 = runif(1000000, 0, 1000) 
model1 = lm(x2~x1)

Using code taken from another thread:

lmp <- function (modelobject) {
    if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
    f <- summary(modelobject)$fstatistic
    p <- pf(f[1],f[2],f[3],lower.tail=F)
    attributes(p) <- NULL
    return(p)
  }
lmp(model1) 
0.3874139

Any suggestions for how I might do this for 1000 models or even more? Thanks!


Solution

  • see ?replicate ... but the p-value you're calculating assumes gaussian errors not uniform ones (not that this will matter much at n=10^6)

    Specifically, something like this:

    nrep <- 1000
    ndat <- 1000000
    results <- replicate(nrep, {
         x1=runif(ndat, 0, 1000);
         x2=runif(ndat, 0, 1000);
         model1=lm(x1 ~ x2);
         lmp(model1)
         })
    

    should work, but it will take a long time to run.

    I'd suggest making nrep and ndat smaller to try it out.