rdata-generationcausality

Simulate effect of confounder on two random variables


I want to generate some data in order to show partial correlation to control for a confounder.

Specifically, I want to generate data about two uncorrelated random variables (let's say speech and memory) and use a third variable to influence them both (age).

I'd expect to observe a strong correlation between speech and memory, due to the confounder age, and no correlation between the same two variables if I control for age (that is, calculate a partial correlation on age).

That said, I can't generate the strong correlation with my code.


age <- rep(1:10, 10)

speech <- age * abs(rnorm(100))
memory <- age * abs(rnorm(100))

cor(speech, memory) # correlation, it should be high but it's not

residuals_speech <- lm(speech ~ age)$residuals
residuals_memory <- lm(memory ~ age)$residuals

cor(residuals_speech, residuals_memory) # partial correlation controlling for age, it should be around zero


Solution

  • The random noise needs to be additive, not multiplicative:

    set.seed(1)
    
    age <- rep(1:10, 10)
    
    speech <- age + rnorm(100)
    memory <- age + rnorm(100)
    

    Now we have a high correlation between speech and memory:

    cor(speech, memory)
    #> [1] 0.9067772
    

    But the correlation of the residuals once age is taken into account is close to zero, as expected.

    residuals_speech <- lm(speech ~ age)$residuals
    residuals_memory <- lm(memory ~ age)$residuals
    
    cor(residuals_speech, residuals_memory) 
    #> [1] 0.0001755437
    

    To get speech and memory to be on different scales with different amounts of noise (as real data might be), you can multiply age and the random noise by scalar values. This will retain the high correlation caused by the confounder.

    set.seed(1)
    
    age <- rep(1:10, 10)
    
    speech <- 5 * age + 3 * rnorm(100)
    memory <- 2.1 * age - 2 * rnorm(100)
    
    cor(speech, memory)
    #> [1] 0.9361466
    
    residuals_speech <- lm(speech ~ age)$residuals
    residuals_memory <- lm(memory ~ age)$residuals
    
    cor(residuals_speech, residuals_memory) 
    #> [1] -0.0001755437
    

    Created on 2022-11-21 with reprex v2.0.2