I need to randomly sample from a fairly complex probability density function (PDF) with a known cumulative distribution function (CDF) and I'm trying to use inverse transform sampling. That should be easy to do, since I have the CDF and just need to numerically invert it (not possible to do algebraically) while plugging in uniform random numbers. However, the resulting distribution has a lower variance than expected and I can not find any mistake in the CDF.
So I simplified and tested my algorithm by sampling from a normal distribution. The result was the same: location is ok, but the scale is wrong. I'm aware that there are better and built-in methods for Gaussian sampling, but this is just a test of the sampling algorithm.
The problem originally arose in Fortran, but I have since replicated the problem in Python, so I have to do something fundamentally wrong or have numerical problems.
import numpy as np
from scipy.special import erf
from scipy.optimize import brentq
import matplotlib.pyplot as plt
from scipy.stats import norm
def testfunc(x):
## Test case, result should be 6.04880103
# out = 0.5 * (1. + erf((x - 5.) / (2. * np.sqrt(2.)))) - 0.7
r = np.random.uniform()
# hand-built cdf:
# out = 0.5 * (1. + erf((x - 5.) / (2. * np.sqrt(2.)))) - r
# scipy cdf:
out = norm.cdf(x, 5, 2) - r
return out
if __name__ == '__main__':
n = 10000
sol_array = np.zeros(n)
for i in range(0, n):
sol_array[i] = brentq(testfunc, -100.,100.)
print('mean = ' + str(np.mean(sol_array)))
print('std = ' + str(np.std(sol_array)))
plt.hist(sol_array, normed=True, bins='fd')
x = np.linspace(-1, 11, 1000)
plt.plot(x, norm.pdf(x, 5, 2))
plt.show()
The mean of the sampled values is about 5, as expected, but the standard deviation is about 1.28, where it should be 2, for both my hand-built CDF and the CDF of scipy
.
This is also visible in the histogram:
The same problem in Fortran, although with a different value for the resulting standard deviation. The code is longer because the solver is included. That solver is a translated Fortran 90 version by Alan Miller of an old FORTRAN 77 netlib-routine (zeroin.f).
implicit none
integer, parameter :: dp = selected_real_kind(15, 307)
integer, parameter :: n = 1000000
real, dimension(n) :: v
real :: mean, std
integer, dimension(:), allocatable :: seed
integer :: i, seedsize, clock
! seed the PRNG
call random_seed(size=seedsize)
allocate(seed(seedsize))
call system_clock(count=clock)
seed=clock + 37 * (/ (i - 1, i=1, seedsize) /)
call random_seed(put=seed)
deallocate(seed)
do i = 1, n
v(i) = real(zeroin(testfunc, -100._dp, 100._dp, 1e-20_dp, 1e-10_dp))
end do
mean = sum(v) / n
std = sum((v - mean)**2) / n
print*, mean, std
contains
function testfunc(v)
implicit none
real(dp), intent(in) :: v
real(dp) :: testfunc, r
call random_number(r)
! testfunc = 0.5 * (1. + erf((v-5.)/(2.*sqrt(2.)))) - 0.7 ! should be 6.04880
testfunc = 0.5 * (1. + erf((v-5.)/(2.*sqrt(2.)))) - r ! Gaussian test with mu=5 and sigma=2
end function testfunc
function zeroin(f, ax, bx, aerr, rerr) result(fn_val)
! original zeroin.f from netlib.org
! code converted using to_f90 by alan miller
! date: 2003-07-14 time: 12:32:54
!-----------------------------------------------------------------------
! finding a zero of the function f(x) in the interval (ax,bx)
! ------------------------
! INPUT:
! f function subprogram which evaluates f(x) for any x in the
! closed interval (ax,bx). it is assumed that f is continuous,
! and that f(ax) and f(bx) have different signs.
! ax left endpoint of the interval
! bx right endpoint of the interval
! aerr the absolute error tolerance to be satisfied
! rerr the relative error tolerance to be satisfied
! OUTPUT:
! abcissa approximating a zero of f in the interval (ax,bx)
!-----------------------------------------------------------------------
! zeroin is a slightly modified translation of the algol procedure
! zero given by Richard Brent in "Algorithms for Minimization without
! Derivatives", Prentice-Hall, Inc. (1973).
implicit none
real(dp), intent(in) :: ax
real(dp), intent(in) :: bx
real(dp), intent(in) :: aerr
real(dp), intent(in) :: rerr
real(dp) :: fn_val
real(dp) :: a, b, c, d, e, eps, fa, fb, fc, tol, xm, p, q, r, s, atol, rtol
interface
real(selected_real_kind(15, 307)) function f(x)
real(selected_real_kind(15, 307)), intent(in) :: x
end function f
end interface
! compute eps, the relative machine precision
eps = epsilon(0.0_dp)
! initialization
a = ax
b = bx
fa = f(a)
fb = f(b)
if (fb*fa > 0.) then
print*, 'a, b, fa, fb', a, b, fa, fb
stop
end if
atol = 0.5 * aerr
rtol = max(0.5_dp*rerr, 2.0_dp*eps)
! begin step
10 c = a
fc = fa
d = b - a
e = d
20 if (abs(fc) < abs(fb)) then
a = b
b = c
c = a
fa = fb
fb = fc
fc = fa
end if
! convergence test
tol = rtol * max(abs(b),abs(c)) + atol
xm = 0.5 * (c-b)
if (abs(xm) > tol) then
if (fb /= 0.0) then
! is bisection necessary
if (abs(e) >= tol) then
if (abs(fa) > abs(fb)) then
! is quadratic interpolation possible
if (a == c) then
! linear interpolation
s = fb / fc
p = (c-b) * s
q = 1.0 - s
else
! inverse quadratic interpolation
q = fa / fc
r = fb / fc
s = fb / fa
p = s * ((c-b)*q*(q-r)-(b-a)*(r-1.0))
q = (q-1.0) * (r-1.0) * (s-1.0)
end if
! adjust signs
if (p > 0.0) q = -q
p = abs(p)
! is interpolation acceptable
if (2.0*p < (3.0*xm*q-abs(tol*q))) then
if (p < abs(0.5*e*q)) then
e = d
d = p / q
go to 30
end if
end if
end if
end if
! bisection
d = xm
e = d
! complete step
30 a = b
fa = fb
if (abs(d) > tol) b = b + d
if (abs(d) <= tol) b = b + sign(tol,xm)
fb = f(b)
if (fb*(fc/abs(fc)) > 0.0) go to 10
go to 20
end if
end if
! done
fn_val = b
end function zeroin
end
The mean of the resulting samples is about 5, while the standard deviation is about 1.64.
Does anyone have an idea where my algorithm could become numerically problematic? The fact that the Python version and the Fortran version both have the same problem, but to different degrees makes me believe that it's some rounding of floating point numbers problem, but I cannot image where. Even if the solver returns a rounded value, that difference should not show up in a simple histogram.
Does anyone see a mistake in my algorithms? Am I understanding something wrong?
I only checked the Python version and there's indeed an error in the implementation.
Namely, your testfunc
, ie the target function of root-finding brentq
routine, behaves non-deterministically. During a root-finding run (i.e. one call to brentq()
until it returns), brentq
needs to call the same callback multiple times until convergence is reached. However, each time brentq
calls this callback, the target equation changes as r
gets a new pseudo-random value. As a result, the root-finding routine cannot converge to your desired solution.
What you need to do instead, conceptually, is to first generate a sample of uniform random variables, and apply the same, deterministic transformation (i.e. the inverse distribution function) to them. Of course you don't need to do root-solving, as you can use the ppf
(percentile function, i.e. inverse of cdf
) method of scipy.stats
random variable classes.
As a proof of concept, you can run the following code with the (unnecessarily expensive and not very accurate) transformation method on an array of standard uniform sample:
import numpy
import numpy.random
from scipy.optimize import brentq
from scipy.stats import norm
# Setup
n = 10000
numpy.random.seed(0x5eed)
ran_array = numpy.random.uniform(size=n)
sol_array = numpy.empty_like(ran_array)
# Target function for root-finding: F(x) - p = 0 where p is the probability level
# for which the quantile is to be found
def targetfunc(x, p):
return norm.cdf(x, 5, 2) - p
for i in range(n):
sol_array[i] = brentq(targetfunc, -100.0, 100.0, args=(ran_array[i],))
print("mean = %10f" % sol_array.mean())
print("std = %10f" % sol_array.std())
Output:
mean = 5.011041
std = 2.009365