rstatisticsdata.tablesimulationmass

simulate negative binomial distribution with offset variable


I am trying to simulate mutation data with known parameters to use it further for testing regression functions. In this simulation I want mutation counts to be dependent on variables:

mutations ~ intercept + beta_cancer + beta_gene + beta_int + offset(log(ntAtRisk)))

where the offset parameter is the maximal number of counts that can theoretically happen.

Creating table with parameters

ncancers <- 20
ngenes <- 20

beta <- CJ(cancer = as.factor(0:ncancers), gene =  as.factor(0:ngenes))
beta[, beta_cancer := rnorm(n = (ncancers+1), sd = 1)[cancer]]
beta[, beta_gene := rnorm(n = (ngenes+1), sd = 1)[gene]]
beta[, beta_int := rnorm(n = (ngenes+1)*(ncancers+1), sd = 1.5)]
beta[, ntAtRisk := abs(round(rnorm(n = (ngenes+1)*(ncancers+1), mean = 5000, sd  = 2000), digits = 0))[gene]]
beta[, intercept := rnorm(n = (ngenes+1)*(ncancers+1), mean = 2, sd = 1)[gene]]

beta[cancer == "0", c("beta_cancer", "beta_int") := 0] # reference cancer type
beta[gene == "0", c("beta_gene", "beta_int") := 0] # reference gene

Simulating mutation counts

beta[, mu := exp(intercept + beta_cancer + beta_gene + beta_int + log(ntAtRisk))]
setkey(beta, cancer, gene)

dat <- beta
setkey(dat, cancer, gene)
dat[, mutations := rnbinom(n = nrow(dat), mu = mu, size = 1.5)]
dat[, mutations2 := MASS::rnegbin(n = nrow(dat), 
                                  mu = exp(intercept + beta_cancer + beta_gene + 
                                           beta_int + offset(log(ntAtRisk))), 
                                  theta = 1.5)]

mutations and mutations2 are made using different functions where offset variable either is included as a normal variable or, in second case, is specified as an offset. However, the test I am doing doesn't pass any of them.

I need mutation counts to be no greater than ntAtRisk but it is not the case, unfortunately. I couldn't find on the internet how I can include the offset into the simulation. What are my options?

ggplot(dat, aes(ntAtRisk, mutations+0.5)) +
  geom_point() +
  xlim(0, max(dat$ntAtRisk)) + 
  ylim(0, max(dat$ntAtRisk)) + 
  geom_abline(color = "red") 

enter image description here


Solution

  • When you fit a glm for poisson, negbin with an offset, the sum of your coefficients and intercepts cannot be greater than 1, because the log(offset) is subtracted from log(response) and this is always < 1, for example:

    n=seq(100,1000,by=100)
    mu = n/5
    y = rnbinom(n = 10,size =1.5,mu=mu)
    glm.nb(y~1+offset(log(n)))
    
    Call:  glm.nb(formula = y ~ 1 + offset(log(n)), init.theta = 1.217692649, 
        link = log)
    
    Coefficients:
    (Intercept)  
         -1.424 
    

    It's a very tricky simulation to set up because of the constraints, in your case, I suggest setting the intercept to be something very low since most likely mutations (if I get it correct), are not so frequent anyway:

    set.seed(222)
    beta <- CJ(cancer = as.factor(0:ncancers), gene =  as.factor(0:ngenes))
    beta[, beta_cancer := rnorm(n = (ncancers+1))[cancer]]
    beta[, beta_gene := rnorm(n = (ngenes+1))[gene]]
    beta[, beta_int := rnorm(n = (ngenes+1)*(ncancers+1))]
    beta[, ntAtRisk := abs(round(rnorm(n = (ngenes+1)*(ncancers+1), mean = 5000, sd  = 2000), digits = 0))[gene]]
    beta[, intercept := runif(n = (ngenes+1)*(ncancers+1),min=-5,max=-3)[gene]]
    beta[cancer == "0", c("beta_cancer", "beta_int") := 0] # reference cancer type
    beta[gene == "0", c("beta_gene", "beta_int") := 0] # reference gene
    

    At this stage, you will accounted for the offset by adding the log term, there's no need to add the offset again later:

    beta[, mu := exp(intercept + beta_cancer + beta_gene + beta_int + log(ntAtRisk))]
    setkey(beta, cancer, gene)
    

    Now we simulate the data, providing the mean as mu and you specify a constant theta value:

    dat <- beta
    setkey(dat, cancer, gene)
    dat[, mutations := rnbinom(n = nrow(dat), mu = mu, size = 1.5)]
    
    ggplot(dat, aes(ntAtRisk, mutations+0.5)) +
      geom_point() +
      xlim(0, max(dat$ntAtRisk)) + 
      ylim(0, max(dat$ntAtRisk)) + 
      geom_abline(color = "red") 
    

    enter image description here

    You can see in this example, some of the counts are > n, because of the dispersion. You either write a code to manually correct this, or I guess you need to really check the data if you do have a prediction so high.