I've got a problem at work that goes as follows:
We have, say, 1 million possible events which define frequency-severity distributions. For each event we have an annual rate which defines a Poisson distribution, and alpha and beta parameters for a Beta distributions. The goal is to simulate in the order of >100,000 "years", each year being defined as getting a frequency N for each event, and getting N samples of the relative beta distribution.
The cumbersome fact for me is how can I efficiently get N_i ~ Poisson(lambda_i) samples from Beta distribution Beta_i while also making sure I can attribute them to the correct year?
In terms of output I'll need to look at both the maximum and total value of the samples per year, so temporarily I'm just storing it as an array of dictionaries (not intended to be the output format)
years = 5000
rng = np.random.default_rng()
losses = []
for year in range(years):
occurences = rng.poisson(data['RATE'])
annual_losses = []
for idx, occs in enumerate(occurences):
if occs > 0:
event = data.iloc[idx]
for occ in range(occs):
loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
annual_losses.append(loss)
annual_losses.append(0)
losses.append({'year': year, 'losses': annual_losses})
I've tried to follow Optimising Python/Numpy code used for Simulation however I can't seem to understand how to vectorise this code efficiently.
Changes I've made before posting here are (times for 5000 years):
Ideally I'd like this to run as fast, or for as many iterations, as possible, and I was also running into issues with memory using scipy previously.
EDIT:
Upon request here's a version that just calculates the maximum loss for each year
years = 20_000
rng = np.random.default_rng()
largest_losses = np.zeros(shape=years)
for year in range(years):
occurences = rng.poisson(data['RATE'])
largest_loss = 0
for idx, event_occs in enumerate(occurences):
if event_occs > 0:
event = data.iloc[idx]
for event_occ in range(event_occs):
loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
if loss > largest_loss:
largest_loss = loss
largest_losses[year] = largest_loss
For testing purposes the events have a total rate of ~0.997 and the timed tests above are from an event pool of 100,604 events.
To specify the goal, I'd like to see if I can calculate the percentiles of the losses, e.g. the 1-in-250 loss
np.percentile(largest_losses, (1 - 1/250) * 100)
which currently aren't accurate at 5,000, hence wanting a process that can run in a few minutes or less for ~100,000 years.
EDIT 2: Reproduceable values (as np arrays):
rate_scale = 4.81e-4
rate_shape = 2.06e-2
rates = rng.gamma(rate_shape, rate_scale, size=nevents)
alpha_scale = 1.25
alpha_shape = 0.439
alphas = rng.gamma(alpha_shape, alpha_scale, size=nevents)
beta_scale = 289
beta_shape = 0.346
betas = rng.gamma(beta_shape, beta_scale, size=nevents)
value_scale = 4.2e8
value_shape = 1.0
values = rng.gamma(value_shape, value_scale, size=nevents)
and have amended my current code to use numpy arrays
years = 20_000
rng = np.random.default_rng()
largest_losses = np.zeros(shape=years)
for year in range(years):
occurences = rng.poisson(rates)
largest_loss = 0
for idx, event_occs in enumerate(occurences):
for event_occ in range(event_occs):
loss = rng.beta(alphas[idx], betas[idx]) * values[idx]
if loss > largest_loss:
largest_loss = loss
largest_losses[year] = largest_loss
f'{np.percentile(largest_losses, (1 - 1/250) * 100):,.0f}'
I profiled @WarrenWeckesser's answer, and I found that it spends 93% of its time generating poisson-distributed numbers using the rate
array.
Here's a line profile of what I'm talking about:
Timer unit: 1e-09 s
Total time: 3.57874 s
File: /tmp/ipykernel_2774/1125349843.py
Function: simulate_warren_answer at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def simulate_warren_answer(nyears, rate, alpha, beta, value, rng):
2 1 13092.0 13092.0 0.0 max_loss = np.zeros(nyears, dtype=np.float64)
3 1 3150.0 3150.0 0.0 total_loss = np.zeros(nyears, dtype=np.float64)
4 1001 347152.0 346.8 0.0 for year in range(nyears):
5 1000 3355879564.0 3e+06 93.8 n = rng.poisson(rate)
6 1000 179118381.0 179118.4 5.0 k = np.nonzero(n)[0]
7 1000 911231.0 911.2 0.0 if len(k) > 0:
8 654 935876.0 1431.0 0.0 nonzero_n = n[k]
9 654 4049698.0 6192.2 0.1 a = np.repeat(alpha[k], nonzero_n)
10 654 2645836.0 4045.6 0.1 b = np.repeat(beta[k], nonzero_n)
11 654 2550944.0 3900.5 0.1 v = np.repeat(value[k], nonzero_n)
12 654 26183513.0 40036.0 0.7 losses = rng.beta(a, b)*v
13 654 3557136.0 5439.0 0.1 max_loss[year] = losses.max()
14 654 2543208.0 3888.7 0.1 total_loss[year] = losses.sum()
15 1 303.0 303.0 0.0 return max_loss, total_loss
For this reason, I started looking into whether it was possible to speed up poisson-distributed random number generation. I started by looking at the lengths of n and k, and realized that for the rates that the asker is interested in, k is much much smaller than n, and most entries in n are zero.
If there is a 98% chance that the poisson-distributed number will be zero, I can simulate this by generating a uniformly distributed random number between [0, 1), and comparing it to 0.02. This leaves me to handle the other case 2% of the time, where the number will be 1+.
I then realized that I could generalize this to generating any value of the poisson function, by comparing the random uniform value to the CDF. (In the code below, I am using the survival function, which is 1-CDF. This is for accuracy reasons; numbers near 0 are represented more accurately than numbers near 1.)
This code generates a lookup table for converting uniformly-distributed numbers into poisson-distributed numbers according to each rate. Because the rates are small, most of the time, the code will only need to examine the first entry for each rate, and the result will be zero. It also does the k = np.nonzero(n)[0]
step inside this loop, as it is more cache efficient.
Because the rates are not changing, the lookup table only needs to be build once, and can be re-used between years.
import scipy
import numpy as np
import numba as nb
def simulate_new_rng(nyears, rate, alpha, beta, value, rng):
max_loss = np.empty(nyears, dtype=np.float64)
total_loss = np.zeros(nyears, dtype=np.float64)
lookup = build_lookup(rate)
for year in range(nyears):
uniform = rng.random(len(rate))
n, nonzeros = uniform_to_poisson_plus_nonzero(uniform, lookup)
n_nonzeros = n[nonzeros]
a = np.repeat(alpha[nonzeros], n_nonzeros)
b = np.repeat(beta[nonzeros], n_nonzeros)
v = np.repeat(value[nonzeros], n_nonzeros)
losses = rng.beta(a, b)*v
max_loss[year] = losses.max(initial=0)
total_loss[year] = losses.sum()
return max_loss, total_loss
def build_lookup(rate, ptol=1e-15):
"""Build uniform -> poisson lookup table based on survival function
Does not include an entry for k if the probability of the most likely
rate to be k is less than ptol."""
ret = []
for k in itertools.count():
sf = scipy.stats.poisson.sf(k, rate)
# include first row of lookup table no matter what
if sf.max() < ptol and k != 0:
break
ret.append(sf)
return np.array(ret)
@nb.njit()
def uniform_to_poisson_plus_nonzero(numbers, lookup):
assert numbers.shape[0] == lookup.shape[1]
nonzeros = []
ret = np.empty(lookup.shape[1], dtype=np.int64)
for j in range(lookup.shape[1]):
number = numbers[j]
for i in range(lookup.shape[0]):
if number < lookup[i, j]:
continue
else:
break
else:
raise Exception("failed to find match for survival function - ptol is too big")
ret[j] = i
if i != 0:
nonzeros.append(j)
return ret, np.array(nonzeros, dtype=np.int64)
Benchmark result:
Original code: 24.4 s ± 308 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Warrens answer: 3.44 s ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My answer: 648 ms ± 17.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
import matplotlib.pyplot as plt
import itertools
import scipy
import numba as nb
import numpy as np
import pandas as pd
rng = np.random.default_rng()
nevents = 100604
nyears = 1000
rate_scale = 4.81e-4
rate_shape = 2.06e-2
rates = rng.gamma(rate_shape, rate_scale, size=nevents)
alpha_scale = 1.25
alpha_shape = 0.439
alphas = rng.gamma(alpha_shape, alpha_scale, size=nevents)
beta_scale = 289
beta_shape = 0.346
betas = rng.gamma(beta_shape, beta_scale, size=nevents)
value_scale = 4.2e8
value_shape = 1.0
values = rng.gamma(value_shape, value_scale, size=nevents)
def simulate_orig_answer(nyears, rate, alpha, beta, value, rng):
data = pd.DataFrame({'RATE': rate, 'alpha': alpha, 'beta': beta, 'ExpValue': value})
rng = np.random.default_rng()
largest_losses = np.zeros(shape=nyears)
total_loss = np.zeros(nyears, dtype=np.float64)
for year in range(nyears):
occurences = rng.poisson(data['RATE'])
largest_loss = 0
sum_loss = 0
for idx, event_occs in enumerate(occurences):
if event_occs > 0:
event = data.iloc[idx]
for event_occ in range(event_occs):
loss = rng.beta(event['alpha'], event['beta']) * event['ExpValue']
if loss > largest_loss:
largest_loss = loss
sum_loss += loss
largest_losses[year] = largest_loss
total_loss[year] = sum_loss
return largest_losses, total_loss
def simulate_warren_answer(nyears, rate, alpha, beta, value, rng):
max_loss = np.zeros(nyears, dtype=np.float64)
total_loss = np.zeros(nyears, dtype=np.float64)
for year in range(nyears):
n = rng.poisson(rate)
k = np.nonzero(n)[0]
if len(k) > 0:
nonzero_n = n[k]
a = np.repeat(alpha[k], nonzero_n)
b = np.repeat(beta[k], nonzero_n)
v = np.repeat(value[k], nonzero_n)
losses = rng.beta(a, b)*v
max_loss[year] = losses.max()
total_loss[year] = losses.sum()
return max_loss, total_loss
def simulate_new_rng(nyears, rate, alpha, beta, value, rng):
max_loss = np.empty(nyears, dtype=np.float64)
total_loss = np.zeros(nyears, dtype=np.float64)
lookup = build_lookup(rate)
for year in range(nyears):
uniform = rng.random(len(rate))
n, nonzeros = uniform_to_poisson_plus_nonzero(uniform, lookup)
n_nonzeros = n[nonzeros]
a = np.repeat(alpha[nonzeros], n_nonzeros)
b = np.repeat(beta[nonzeros], n_nonzeros)
v = np.repeat(value[nonzeros], n_nonzeros)
losses = rng.beta(a, b)*v
max_loss[year] = losses.max(initial=0)
total_loss[year] = losses.sum()
return max_loss, total_loss
def build_lookup(rate, ptol=1e-15):
"""Build uniform -> poisson lookup table based on survival function
Does not include an entry for k if the probability of the most likely
rate to be k is less than ptol."""
ret = []
for k in itertools.count():
sf = scipy.stats.poisson.sf(k, rate)
# include first row of lookup table no matter what
if sf.max() < ptol and k != 0:
break
ret.append(sf)
return np.array(ret)
@nb.njit()
def uniform_to_poisson_plus_nonzero(numbers, lookup):
assert numbers.shape[0] == lookup.shape[1]
nonzeros = []
ret = np.empty(lookup.shape[1], dtype=np.int64)
for j in range(lookup.shape[1]):
number = numbers[j]
for i in range(lookup.shape[0]):
if number < lookup[i, j]:
continue
else:
break
else:
raise Exception("failed to find match for survival function - ptol is too big")
ret[j] = i
if i != 0:
nonzeros.append(j)
return ret, np.array(nonzeros, dtype=np.int64)
%timeit simulate_orig_answer(nyears, rates, alphas, betas, values, rng)
%timeit simulate_warren_answer(nyears, rates, alphas, betas, values, rng)
%timeit simulate_new_rng(nyears, rates, alphas, betas, values, rng)