rsimulationmixture-modelmixture

Simulate mixture data with different mix dependecies structure between each two variables?


I would like to simulate a mixture data, say 3 dimensional data. I would like to have 2 different components between each two variables.

That is, simulate mixture data (V1 and V2) where the dependencies between them is two different normal components. Then, between V2 and V3 another two normal components. So, I will have 3d data, the dependence between first and a second variable are a mixture of two normals. And the dependence between the second and third variable are mixture of another two different components.

Another way to explain my question:

Suppose I would like to generate a mixture data as follows:

1- 0.3 normal(0.5,1) + 0.7 normal(2,4) # hence here I will get a bivariate mixture data generated from two different normal (two components of the mixture model), the sum of mixuter weight is 1.

Then, I would like to get another variable as follows:

2- 0.5 normal(2,4) # this is the second variable on the first simulate + 0.5 normal(2,6)

so here, I get 3d simulated mixture data, where V1 and V2 are generated by two different mixture components, and the V2 and V3 are generated by another different mixture components.

This is how to generate the data in r: ( I belive it is not generate a bivariate data)

N <- 100000                 

#Sample N random uniforms U
U <- runif(N)

#Variable to store the samples from the mixture distribution                                             
rand.samples <- rep(NA,N)

#Sampling from the mixture
for(i in 1:N) {
    if(U[i]<.3) {
        rand.samples[i] <- rnorm(1,1,3)
    } else {
        rand.samples[i] <- rnorm(1,2,5)
    }
}

so if we generate mixture bivariate data (two variables) then how can extend this to have 4 or 5 variables, where V1 and V2 are generated from two different normals (the dependencies structures between them is a mixture of two normals) and then V3 will generated from another another differetn normal and then compine with V2. That is when we plot the V2 ~ V3 we will find that the dependencies structures between them is a mixutre of two normals and so on.


Solution

  • I am not really sure I have correctly understood the question but I will give it a try. You have 3 distributions D1, D2 and D3. From these three distributions you would like to create variables that use 2 out of those 3 but not the same ones.

    Since I do not know how the distributions should be combined I used the flags using the binomial distribution (its a vector of length equal to 200 with 0s and 1s) to determine from which distribution each value will be picked (You can change that if that is not how you want it done).

    D1 = rnorm(200,2,1)
    D2 = rnorm(200,3,1)
    D3= rnorm(200,1.5,2)
    

    In order to created the mixed distribution we can use the rbinom function to create a vector of 1s and 0s according to a selected probability. This is a way to have some values from both distributions.

    var_1_flag <- rbinom(200, size=1, prob = 0.3)
    var_1 <- var_1_flag*D1 + (1 - var_1_flag)*D2
    
    var_2_flag <- rbinom(200, size=1, prob = 0.7)
    var_2 <- var_2_flag*D2 + (1 - var_2_flag)*D3
    
    var_3_flag <- rbinom(200, size=1, prob = 0.6)
    var_3 <- var_3_flag*D1 + (1 - var_3_flag)*D3
    

    In order to see which values come from which distribution you can do the following:

    var_1[var_1_flag] #This gives you the values in the mixed distribution that come from the first distribution (D1)

    var1[!var_1_flag] #This gives you the values in the mixed distribution that come from the second distribution (D2)

    Since I found this a bit manual and I am guessing you might want to change the variables, you might want to use the function below to get the same results

    create_distr <- function(observations, mean1, sd1, mean2, sd2, flag_prob) {
    
        flag <- rbinom(observations, size=1, prob = flag_prob)
        my_distribution <- flag * rnorm(observations, mean1, sd1) + (1 - flag) * rnorm(observations, mean2, sd2)
    }
    
    var_1 <- create_distr(200, 2, 1, 3, 1, 0.5)
    var_2 <- create_distr(200, 3, 1, 1.5, 2, 0.7)
    var_3 <- create_distr(200, 2, 1, 1.5, 2, 0.6)
    

    If you would like to have more than two variables (distributions) to the mix you could extend the code you have provided as follows:

    N <- 100000                 
    
    #Sample N random uniforms U
    U <- runif(N)
    
    #Variable to store the samples from the mixture distribution                                             
    rand.samples <- rep(NA,N)
    
    for(i in 1:N) {
      if(U[i] < 0.3) {
        rand.samples[i] <- rnorm(1,1,3)
      } else if (U[i] < 0.5){
        rand.samples[i] <- rnorm(1,2,5)
      } else if (U[i] < 0.8) {
        rand.samples[i] <- rnorm(1,5,2)
      } else {
        rand.samples[i] <- rt(1, 2)
      }
    }
    

    This way every element is taken one at a time from each distribution. If you want to have the same result but without taking each element one at a time you can do the following:

    N <- 100000                 
    
    #Sample N random uniforms U
    U <- runif(N)
    
    #Variable to store the samples from the mixture distribution                                             
    rand.samples <- rep(NA,N)
    
    D1 = rnorm(N,1,3)
    D2 = rnorm(N,2,5)
    D3= rnorm(N,5,2)
    D4 = rt(N, 2)
    
    rand.samples <- c(D1[U < 0.3], D2[U >= 0.3 & U < 0.5], D3[U >= 0.5 & U < 0.8], D4[U >= 0.8])
    

    Which corresponds to 0.3*normal(1,3) + 0.2*normal(2,5) + 0.3*normal(5,2) + 0.2*student(2 degrees of freedom)

    If you want to create two mixtures, but in the second keep the same values from the normal distribution you can do the following:

    mixture_1 <- c(D1[U < 0.3], D2[U >= 0.3 ])
    mixture_2 <- c(D1[U < 0.3], D3[U >= 0.3])
    

    This will use the exact same elements from normal(1,3) in both mixtures. The trick is to not recalculate the rnorm(N,1,3) every time you use it. And in both cases the samples are composed from 30% roughly coming from the first normal (D1) and 70% roughly from the second distribution. For example:

        set.seed(1)
        N <- 100000   
        U <- runif(N)
        > prop.table(table(U < 0.3))
    
     FALSE   TRUE 
    0.6985 0.3015 
    

    30% of the values in the U vector is below 0.3.