I have generated a large simulated population polars dataframe using numpy arrays. I want to randomly sample from this population dataframe multiple times. However, when I do that, the samples are exactly the same from sample to sample. I know there must be an easy fix for this, any recommendations? It must be the repeat function, does anyone have any creative ideas for how I can simulate orthogonal multiple random samples?
Here's my code:
N = 1000000 # population size
samples = 1000 # number of samples
num_obs = 100 # size of each sample
# Generate population data
a = np.random.gamma(2, 2, N)
b = np.random.binomial(1, 0.6, N)
x = 0.2 * a + 0.5 * b + np.random.normal(0, 10, N)
z = 0.9 * a * b + np.random.normal(0, 10, N)
y = 0.6 * x + 0.9 * z + np.random.normal(0, 10, N)
# Store this in a population dataframe
pop_data_frame = pl.DataFrame({
'A':a,
'B':b,
'X':x,
'Z':z,
'Y':y,
'id':range(1, N+1)
})
# Get 1000 samples from this pop_data_frame...
#... with 100 observations each sample.
sample_list = list(
repeat(
pop_data_frame.sample(n=num_obs), samples)
)
)
With .repeat()
, you're calling .sample()
once and repeating that 1000 times.
You want to call .sample()
1000 times:
sample_list = [ pop_data_frame.sample(num_obs) for _ in range(samples) ]
Or, you could use polars lazy API to create a list of lazyframes and .collect_all()
which should be faster as polars can parallelize the operation:
sample_list = pl.collect_all(
[
pop_data_frame.lazy().select(
row = pl.struct(pl.all()).sample(num_obs)
).unnest("row")
for _ in range(samples)
]
)