Generating multiple non-identically distributed quasi random variables using SciPy's QMC (Sobel)

I have a problem using SciPy’s Quasi-Monte Carlo features for function estimation. I want to build survival functions for functions of several variables, so I'm using something like this:

Y = c₁f₁(X₁) + c₂f₂(X₂) + c₃f₃(X₃)…

Where:

• c₁, c₂, c₃ are constants

• f₁, f₂, f₃… are functions (mostly linear, but not always)

• X₁, X₂, X₃... are independent but not identically distributed random variables. In fact, the distributions are mostly different.

For “normal” Monte Carlo, I have the problem solved, I just generate random numbers with different distributions, and I put them into the equation above. But for Quasi-Monte Carlo, I’m stuck.

If the random variables were all the same distribution, I could do something like this (x_rvs being my random variables and there being no_dims of them):

engine = scipy.stats.qmc.Sobol(d=no_dims, scramble=True)
dist = scipy.stats.fisk(3.9)
base = scipy.stats.sampling.NumericalInverseHermite(dist)
x_rvs = base.qrvs(size=size, qmc_engine=engine)

I could then use my x_rvs. Problem is, my distributions are all different.

One option is to use uniform distributions over the range [0,1] and map them to other distributions. However, given that the QMC documentation makes a big deal of supporting different distributions, I can’t help feeling there’s a way of directly generating the random variables I need. I know I can’t just generate quasi-random data by using the QMC Sobol engine and using it repeatedly to generate different distributions (the sequences I get are too correlated if I do this and of no use), so I know I have to use multiple dimensions in my engine.

Anyone got any ideas for how to generate multiple quasi-random variables with different distributions that keep the nice properties of QMC?

Solution

However, given that the QMC documentation makes a big deal of supporting different distributions...

At this time, scipy.stats.qmc does not have features for directly generating QMC sequences from any distributions other than the uniform, multinomial, and multivariate normal distributions.

You already found the qrvs method of classes in scipy.stats.sampling; that's the only other thing SciPy has in existing releases.

SciPy 1.15 will have a new random variable infrastructure. You will be able to use make_distribution to convert (almost) any existing SciPy Continuous Distribution to the new interface, from which the sample method will accept either a NumPy Generator or SciPy QMCEngine as argument for the rng parameter. When it receives a QMCEngine, it performs inverse transform sampling to generate QMC samples from other distributions. (I believe this what you were referring to with "One option is to use [uniform] distributions over the range [0,1] and map them to other distributions".) SciPy 1.15 RC1 is available now, and the relevant documentation is here.

I know I can’t just generate quasi-random data by using the QMC Sobol engine and using it repeatedly to generate different distributions (the sequences I get are too correlated if I do this and of no use)

Yes, good not to fall into that trap!

so I know I have to use multiple dimensions in my engine.

That is one option - you could create a multivariate uniform with one dimension per term in your sum, then use the coordinates of each dimension as an "independent" 1D, low-discrepancy sample.

import numpy as np
from scipy import stats

size = 2**10
rng = np.random.default_rng(235855178283348532492417763208635684641)
engine = stats.qmc.Sobol(d=2, scramble=True, seed=rng)
x_uniform = engine.random(size)
x_rvs1 = stats.fisk(3.9).ppf(x_uniform [:, 0])
x_rvs2 = stats.norm.ppf(x_uniform[:, 1])
x_rvs = x_rvs1 + x_rvs2
ecdf = stats.ecdf(x_rvs2)
ecdf.sf.plot()

Another (which may not be as low-discrepancy overall) is to create each QMCEngine from a new Generator with distinct seeds and scramble=True. SciPy has a private _rng_spawn function for this purpose. It is used in qmc_quad to get (for most practical purposes) independent estimates of the integral from which it estimates the error. There was a discussion about this approach here.

So in SciPy 1.14 that might look like:

import numpy as np
from scipy import stats

def _rng_spawn(rng, n_children):
    # https://github.com/scipy/scipy/blob/d916ddbab38468312fd0bee53ff68d5276f4f84f/scipy/_lib/_util.py#L1025
    # spawns independent RNGs from a parent RNG
    bg = rng._bit_generator
    ss = bg._seed_seq
    child_rngs = [np.random.Generator(type(bg)(child_ss))
                  for child_ss in ss.spawn(n_children)]
    return child_rngs

size = 2**10
rng = np.random.default_rng(235855178283348532492417763208635684641)  # high entropy seed generated with np.random.SeedSequence
rngs = _rng_spawn(rng, 2)

engine = stats.qmc.Sobol(d=1, scramble=True, seed=rngs[0])
dist = stats.fisk(3.9)
base = stats.sampling.NumericalInverseHermite(dist)
# Permute the order to avoid artifacts
x_rvs1 = np.random.permutation(base.qrvs(size=size, qmc_engine=engine))

engine = stats.qmc.Sobol(d=1, scramble=True, seed=rngs[1])
dist = stats.norm()
base = stats.sampling.NumericalInverseHermite(dist)
x_rvs2 = np.random.permutation(base.qrvs(size=size, qmc_engine=engine))

# Regular MC Samples
# x_rvs1 = stats.fisk(3.9).rvs(size=size)
# x_rvs2 = stats.norm.rvs(size=size)
x_rvs = x_rvs1 + x_rvs2
ecdf = stats.ecdf(x_rvs2)
ecdf.sf.plot()

For generating an empirical survival function, you might find scipy.stats.ecdf useful (last two lines).

Both of these look a lot nicer than what you'd get with regular MC for a given number of points. For example, with 2**6 points: