pythonpandassample-data

Create a Pandas dataframe of normal estimates based on varying row requirements


I get that this will create a dataframe of a single sample:

samples = np.random.normal(loc=df_avgs['AVERAGE'][region], scale=df_avgs['STDEV'][region], size=1)

But I want to create a sample for each row, based on a condition. For instance, I have a df of means, stdev and a df of conditions.

df_avgs

REGION AVERAGE STDEV
0 -1.61 7.75
1 2.87 8.38
2 3.61 7.61
3 -10.26 9.19

df_conditions

ID REGION_NAME
0 Region 0
1 Region 3
2 Region 2
3 Region 1
4 Region 1
5 Region 2
6 Region 3

How do I create a df of length(df_conditions) or just add a column to df_conditions, with samples based on the region?


Solution

  • IIUC, you can merge the two dataframes together and then, assign the values using list comprehension with a zip of two dataframe columns:

    df_zip = df_conditions.assign(REGION=df_conditions['REGION_NAME'].str.extract('([0-9])').astype(int)).merge(df_avgs)
    
    df_conditions['SAMPLES'] = [np.random.normal(loc=l, scale=s, size=1)[0] for l, s in zip(df_zip['AVERAGE'], df_zip['STDEV'])]
    
    print(df_conditions)
    

    Output:

       ID REGION_NAME    SAMPLES
    0   0    Region 0  -2.475624
    1   1    Region 3  -7.157439
    2   2    Region 2  -4.563650
    3   3    Region 1  -2.199240
    4   4    Region 1   5.221416
    5   5    Region 2   7.175620
    6   6    Region 3 -22.775366