pythonpandasmissing-data

How to generate missing data in one dataframe based on a distribution from another dataframe


I have these two dataframes, in the first one, I have category2 and category3 , while category1 is missing, I need to fill this for each , year, each month, each class and each region based on df2. Note the data has several months, several months and years, only pasted an excerpt here.

Here is df1

Year Month Class Region Category2 Category3 Vol
2022 1 AA R1 S1 F1 1
2022 1 AA R1 S2 F1 3
2022 1 AA R2 S1 F1 3
2022 1 AA R2 S2 F1 4
2022 1 AA R3 S1 F2 4
2022 1 AA R4 S1 F2 12
2022 1 AA R4 S2 F2 4
2022 1 AA R5 S1 F2 10
2022 1 AA R5 S2 F2 1
2022 1 AA R6 S1 F2 1
2022 1 AA R7 S1 F2 7
2022 1 AA R7 S2 F2 2
2022 1 AA R8 S1 F2 70
2022 1 AA R8 S2 F2 2
2022 1 AA R8 S1 F1 5
2022 1 AA R8 S1 F2 2
2022 1 AA R8 S2 F1 10
2022 1 AA R8 S2 F2 1
2022 1 AA R9 S1 F1 3
2022 1 AA R9 S1 F2 5
2022 1 AA R9 S2 F1 3

Here is df2

Year Month Class Region Category1 Category2 Category3 Vol
2022 1 AA R1 Shift1 S1 F1 1
2022 1 AA R1 Shift1 S2 F1 3
2022 1 AA R2 Shift1 S1 F2 1
2022 1 AA R2 Shift1 S2 F1 5
2022 1 AA R3 Shift2 S2 F1 4
2022 1 AA R3 Shift1 S1 F1 48
2022 1 AA R3 Shift1 S1 F2 37
2022 1 AA R3 Shift1 S1 F3 5
2022 1 AA R3 Shift1 S2 F1 248
2022 1 AA R3 Shift1 S2 F2 3
2022 1 AA R3 Shift1 S2 F3 2
2022 1 AA R4 Shift2 S1 F2 7
2022 1 AA R4 Shift1 S1 F2 100
2022 1 AA R4 Shift1 S1 F3 6
2022 1 AA R4 Shift1 S2 F1 154
2022 1 AA R4 Shift1 S2 F2 45
2022 1 AA R4 Shift1 S2 F3 35
2022 1 AA R5 Shift2 S1 F1 2
2022 1 AA R5 Shift2 S1 F2 8
2022 1 AA R5 Shift2 S2 F1 3
2022 1 AA R5 Shift1 S1 F1 30

So I need to take the category1 distribution for the same year, same month , same class and same region from df2 and split the volume in df1 as per that

For example, if we look at row 6, the total volume 2022, January, Class AA, Region R4 Category2, S1, Category 3 F2 is 12

Year Month Class Region Category2 Category3 Vol
2022 1 AA R4 S1 F2 12

From df2 , we we see that, for the same month, year, category 2 and category3, the split between category1 is 7% and 93%

Year Month Class Region Category1 Category2 Category3 Vol
2022 1 AA R4 Shift2 S1 F2 7
2022 1 AA R4 Shift1 S1 F2 100

I need to then regernate the rows in table 1 so that the 12 is split in this this manner with a new category1 column

so this

Year Month Class Region Category2 Category3 Vol
2022 1 AA R4 S1 F2 12

would become this

Year Month Class Region Category1 Category2 Category3 Vol
2022 1 AA R4 Shift1 S1 F2 1
2022 1 AA R4 Shift2 S1 F2 11

I also need to round it so that the individual values I whole numbers without exceeding the total of 12.

I have no idea how to do this in python though I suspect I would have to use groupby for part of it


Solution

  • I believe this will solve your problem:

    df2_copy=df2.copy()
    
    group_cols = ['Year', 'Month', 'Class', 'Region', 'Category2', 'Category3']
    df2_copy['Vol'] = df2_copy.groupby(group_cols)['Vol'].transform(lambda x: x / x.sum()) #apply your function to df2 as a pre calculation
    #only get rows which needed split
    df_not_equal_1 = df2_copy[df2_copy['Vol'] != 1] 
    
    df_merged=df1.merge(df_not_equal_1,on= ['Year', 'Month', 'Class', 'Region', 'Category2', 'Category3'],how='right')
    df_merged['Vol_Multiplied'] = df_merged['Vol_x'] * df_merged['Vol_y']
    df_merged=df_merged.drop(['Vol_x','Vol_y'],axis=1)
    print(df_merged)
    

    Result:

       Year  Month Class Region Category2 Category3 Category1  Vol_Multiplied
    0  2022      1    AA     R4        S1        F2    Shift1        0.785047
    1  2022      1    AA     R4        S1        F2    Shift2       11.214953
    

    If it is what you want to achive, please update your question since it is very complex, without bici.sancta's fair question it was difficult to understand. In addition you should add how to round result of vols since 11-1 seperation doesn't seem mathematically correct. I hope it helped,