rrandomdirichlet

Sampling of positive weights with sum unity and an equality constraint


Suppose I have a vector of positive weights a=(a1, a2, a3, a4) such that a2=a3 and a1+a2+a3+a4=1. Is there any way to sample this kind of weights using R? I tried to think about using the Dirichlet distribution, but it gives no mechanism to force two of the variates to be equal.


Solution

  • To sample evenly across the set {(a1, a2, a3, a4 | a2=a3, a1+a2+a3+a4=1, a1>0, a2>0, a3>0, a4>0}, I would first sample the value for a2 (which is equal to a3). To do this we need to know the distribution of this value. If a2 = a3 = r, then we have a1+a4 = 1-2r; for positive a1 and a4 there is a line segment of length (1-2k)*sqrt(2) containing all feasible values of a1 and a4. Integrating, the probability that a2 is k or less is 4(k - k^2). In more detail:

    Prob (a2 <= k) = Integral(0 to k) (1-2r)*sqrt(2) dr / Integral(0 to 0.5) (1-2r)*sqrt(2) dr
                   = ((k-k^2)*sqrt(2)) / (sqrt(2)/4)
                   = 4k - 4k^2
    

    Thus, we can sample values for a2 by selecting a uniformly distributed value u~U(0, 1) and setting a2 to equal the value k for which 4k - 4k^2 = u. Solving via the quadratic formula, this yields:

    a2 = 0.5 * (1 - sqrt(1-u))
    

    In R, we can sample 1000 values for a2 with:

    set.seed(144)
    a2 <- 0.5 * (1 - sqrt(1 - runif(1000)))
    a3 <- a2
    

    Given a fixed value a2 = a3 = k, the value of a1 is uniformly distributed in [0, 1-2k]:

    a1 <- runif(1000) * (1 - 2*a2)
    

    Having specified a1, a2, and a3, there is only one possible value for a4:

    a4 <- 1 - a1 - a2 - a3
    

    We can take a look at some of our sampled values:

    head(cbind(a1, a2, a2, a4))
    #              a1         a2         a2         a4
    # [1,] 0.83455239 0.01251016 0.01251016 0.14042729
    # [2,] 0.02744599 0.22932773 0.22932773 0.51389856
    # [3,] 0.45835472 0.23860119 0.23860119 0.06444291
    # [4,] 0.36843649 0.14679703 0.14679703 0.33796946
    # [5,] 0.35109881 0.08702039 0.08702039 0.47486041
    # [6,] 0.02916818 0.19942616 0.19942616 0.57197949
    

    Here is the distribution of a1 values (note that by symmetry this is identical to the distribution of the a4 values). Because we select a1 uniformly in the range [0, 1-2*a2], lower values are more common than higher values:

    enter image description here

    Here is the distribution of a2 values (by definition this is the same as the distribution of the a3 values). The shape of the distribution is similar to the one of a1, but the maximum value is 0.5:

    enter image description here