pythongamma-distributiongammaimbalanced-data

How to take a more balanced sample data Python


I have a dataframe with nomalized percentage info. Eg.

wordCount number Percent

2.0 1282 0.267345

1.0 888 0.185213

3.0 1124 0.170791

4.0 1250 0.152877

5.0 554 0.084864

6.0 333 0.058904

7.0 160 0.024290

8.0 111 0.016851

All percentage can be sum up to 1. The dataframe is 6000 entries. I wish to take 2000 sample from it. The 2000 sample shall be as balance as possible.

It shall include maximum the small amount of percentage data and minimun the large amount of percentage data.

I dont know how to do it.

Eg. 2000 has all data from wordCount 8.0 and have minimum data from 2.0.

When I plot the gamma distribution, the line shall be as flat as possible.


Solution

  • First you need to calculate how many samples to take from each word count. Assuming 'wc' is a dataframe with columns 'wordCount' and 'number':

     options = len(wc)
     remaining = 2000
     wc['how many'] = 0
     wc = wc.sort_values('number').reset_index().drop('index', axis=1)
     for i in range(options):
         wc['how many'][i] = min(wc['number'][i], remaining // (options - i))
         remaining -= wc['how many'][i]
    

    The 'how many' columns now have the number you want to sample from each wordCount. Then on your dataframe, let's say named 'data', you should have a matching column named 'wordCount,' and you can sample the number you need with:

    for i in data['wordCount'].unique():
        part_data = data[data['wordCount'] == i]
        part_sample = part_data.sample(wc[wc['wordCount'] == i].iloc[0, -1])
        try:
            all_samples = pd.concat([all_samples, part_sample])
        except NameError:
            all_samples = part_sample.copy()
    

    In the end, 'all_samples' should have the 2000 samples with the distribution you asked.

    btw: looping on dataframe rows is generally a very bad idea, and it could have been vectorized, but since it's just 8 rows, I allowed myself.