pythonpandassampling

Pandas: Sampling from a DataFrame according to a target distribution


I have a Pandas DataFrame containing a dataset D of instances drawn from a distribution x. x may be a uniform distribution for example.

Now, I want to draw n samples from D, sampled according to some new target_distribution, such as a gaussian, that is in general different than x. How can I do this efficiently?

Right now, I sample a value x, subset D such that it contains all x +- eps and sample from that. But this is quite slow when the datasets get bigger. People must have come up with a better solution. Maybe the solution is already good but could be implemented more efficiently?

I could split x into strata, which would be faster, but is there a solution without this?

My current code, which works fine but is slow (1 min for 30k/100k, but I have 200k/700k or so.)

import numpy as np
import pandas as pd
import numpy.random as rnd
from matplotlib import pyplot as plt
from tqdm import tqdm

n_target = 30000
n_dataset = 100000

x_target_distribution = rnd.normal(size=n_target)
# In reality this would be x_target_distribution = my_dataset["x"].sample(n_target, replace=True)

df = pd.DataFrame({
    'instances': np.arange(n_dataset),
    'x': rnd.uniform(-5, 5, size=n_dataset)
    })

plt.hist(df["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

def sample_instance_with_x(x, eps=0.2):
    try:
        return df.loc[abs(df["x"] - x) < eps].sample(1)
    except ValueError: # fallback if no instance possible
        return df.sample(1)

df_sampled_ = [sample_instance_with_x(x) for x in tqdm(x_target_distribution)]
df_sampled = pd.concat(df_sampled_)

plt.hist(df_sampled["x"], histtype="step", density=True)
plt.hist(x_target_distribution, histtype="step", density=True)

Solution

  • Rather than generating new points and finding a closest neighbor in df.x, define the probability that each point should be sampled according to your target distribution. You can use np.random.choice. A million points are sampled from df.x in a second or so for a gaussian target distribution like this:

    x = np.sort(df.x)
    f_x = np.gradient(x)*np.exp(-x**2/2)
    sample_probs = f_x/np.sum(f_x)
    samples = np.random.choice(x, p=sample_probs, size=1000000)
    

    sample_probs is the key quantity, as it can be joined back to the dataframe or used as an argument to df.sample, e.g.:

    # sample df rows without replacement
    df_samples = df["x"].sort_values().sample(
        n=1000, 
        weights=sample_probs, 
        replace=False,
    )
    

    The result of plt.hist(samples, bins=100, density=True):

    corrected image

    We can also try gaussian distributed x, uniform target distribution

    x = np.sort(np.random.normal(size=100000))
    f_x = np.gradient(x)*np.ones(len(x))
    sample_probs = f_x/np.sum(f_x)
    samples = np.random.choice(x, p=sample_probs, size=1000000)
    

    sample to uniform distribution from gaussian distributed points

    The tails would look more uniform if we increased the bin size; this is an artifact that D is sparse at the edges.

    comments

    This approach basically computes the probability of sampling any x_i as the span of x associated with x_i and the probability density in the neighborhood:

    prob(x_i) ~ delta_x*rho(x_i)

    A more robust treatment would be to integrate rho over the span delta_x associated with each x_i. Also note that there will be error if the delta_x term is ignored, as can be seen below. It would be much worse if the original x_i wasn't approximately uniformly sampled:

    un-corrected version