I have these two dataframes, in the first one, I have category2 and category3 , while category1 is missing, I need to fill this for each , year, each month, each class and each region based on df2
. Note the data has several months, several months and years, only pasted an excerpt here.
Here is df1
Year | Month | Class | Region | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|
2022 | 1 | AA | R1 | S1 | F1 | 1 |
2022 | 1 | AA | R1 | S2 | F1 | 3 |
2022 | 1 | AA | R2 | S1 | F1 | 3 |
2022 | 1 | AA | R2 | S2 | F1 | 4 |
2022 | 1 | AA | R3 | S1 | F2 | 4 |
2022 | 1 | AA | R4 | S1 | F2 | 12 |
2022 | 1 | AA | R4 | S2 | F2 | 4 |
2022 | 1 | AA | R5 | S1 | F2 | 10 |
2022 | 1 | AA | R5 | S2 | F2 | 1 |
2022 | 1 | AA | R6 | S1 | F2 | 1 |
2022 | 1 | AA | R7 | S1 | F2 | 7 |
2022 | 1 | AA | R7 | S2 | F2 | 2 |
2022 | 1 | AA | R8 | S1 | F2 | 70 |
2022 | 1 | AA | R8 | S2 | F2 | 2 |
2022 | 1 | AA | R8 | S1 | F1 | 5 |
2022 | 1 | AA | R8 | S1 | F2 | 2 |
2022 | 1 | AA | R8 | S2 | F1 | 10 |
2022 | 1 | AA | R8 | S2 | F2 | 1 |
2022 | 1 | AA | R9 | S1 | F1 | 3 |
2022 | 1 | AA | R9 | S1 | F2 | 5 |
2022 | 1 | AA | R9 | S2 | F1 | 3 |
Here is df2
Year | Month | Class | Region | Category1 | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|---|
2022 | 1 | AA | R1 | Shift1 | S1 | F1 | 1 |
2022 | 1 | AA | R1 | Shift1 | S2 | F1 | 3 |
2022 | 1 | AA | R2 | Shift1 | S1 | F2 | 1 |
2022 | 1 | AA | R2 | Shift1 | S2 | F1 | 5 |
2022 | 1 | AA | R3 | Shift2 | S2 | F1 | 4 |
2022 | 1 | AA | R3 | Shift1 | S1 | F1 | 48 |
2022 | 1 | AA | R3 | Shift1 | S1 | F2 | 37 |
2022 | 1 | AA | R3 | Shift1 | S1 | F3 | 5 |
2022 | 1 | AA | R3 | Shift1 | S2 | F1 | 248 |
2022 | 1 | AA | R3 | Shift1 | S2 | F2 | 3 |
2022 | 1 | AA | R3 | Shift1 | S2 | F3 | 2 |
2022 | 1 | AA | R4 | Shift2 | S1 | F2 | 7 |
2022 | 1 | AA | R4 | Shift1 | S1 | F2 | 100 |
2022 | 1 | AA | R4 | Shift1 | S1 | F3 | 6 |
2022 | 1 | AA | R4 | Shift1 | S2 | F1 | 154 |
2022 | 1 | AA | R4 | Shift1 | S2 | F2 | 45 |
2022 | 1 | AA | R4 | Shift1 | S2 | F3 | 35 |
2022 | 1 | AA | R5 | Shift2 | S1 | F1 | 2 |
2022 | 1 | AA | R5 | Shift2 | S1 | F2 | 8 |
2022 | 1 | AA | R5 | Shift2 | S2 | F1 | 3 |
2022 | 1 | AA | R5 | Shift1 | S1 | F1 | 30 |
So I need to take the category1 distribution for the same year, same month , same class and same region from df2
and split the volume in df1
as per that
For example, if we look at row 6, the total volume 2022, January, Class AA, Region R4 Category2, S1, Category 3 F2 is 12
Year | Month | Class | Region | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|
2022 | 1 | AA | R4 | S1 | F2 | 12 |
From df2
, we we see that, for the same month, year, category 2 and category3, the split between category1 is 7% and 93%
Year | Month | Class | Region | Category1 | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|---|
2022 | 1 | AA | R4 | Shift2 | S1 | F2 | 7 |
2022 | 1 | AA | R4 | Shift1 | S1 | F2 | 100 |
I need to then regernate the rows in table 1 so that the 12 is split in this this manner with a new category1 column
so this
Year | Month | Class | Region | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|
2022 | 1 | AA | R4 | S1 | F2 | 12 |
would become this
Year | Month | Class | Region | Category1 | Category2 | Category3 | Vol |
---|---|---|---|---|---|---|---|
2022 | 1 | AA | R4 | Shift1 | S1 | F2 | 1 |
2022 | 1 | AA | R4 | Shift2 | S1 | F2 | 11 |
I also need to round it so that the individual values I whole numbers without exceeding the total of 12.
I have no idea how to do this in python though I suspect I would have to use groupby
for part of it
I believe this will solve your problem:
df2_copy=df2.copy()
group_cols = ['Year', 'Month', 'Class', 'Region', 'Category2', 'Category3']
df2_copy['Vol'] = df2_copy.groupby(group_cols)['Vol'].transform(lambda x: x / x.sum()) #apply your function to df2 as a pre calculation
#only get rows which needed split
df_not_equal_1 = df2_copy[df2_copy['Vol'] != 1]
df_merged=df1.merge(df_not_equal_1,on= ['Year', 'Month', 'Class', 'Region', 'Category2', 'Category3'],how='right')
df_merged['Vol_Multiplied'] = df_merged['Vol_x'] * df_merged['Vol_y']
df_merged=df_merged.drop(['Vol_x','Vol_y'],axis=1)
print(df_merged)
Result:
Year Month Class Region Category2 Category3 Category1 Vol_Multiplied
0 2022 1 AA R4 S1 F2 Shift1 0.785047
1 2022 1 AA R4 S1 F2 Shift2 11.214953
If it is what you want to achive, please update your question since it is very complex, without bici.sancta's fair question it was difficult to understand. In addition you should add how to round result of vols since 11-1 seperation doesn't seem mathematically correct. I hope it helped,