There exist the following dataframe:
year | pop0 | pop1 | city0 | city1 |
---|---|---|---|---|
2019 | 20 | 40 | Malibu | NYC |
2018 | 8 | 60 | Sydney | Dublin |
2018 | 36 | 23 | NYC | Malibu |
2020 | 17 | 44 | Malibu | NYC |
2019 | 5 | 55 | Sydney | Dublin |
I would like to calculate the weighted average for the population of each city pair as a new column. For example, the w_mean
for Malibu / NYC = (23+20+17)/(36+40+44) = 0.5.
Following is the desired output:
year | pop0 | pop1 | city0 | city1 | w_mean |
---|---|---|---|---|---|
2018 | 23 | 36 | Malibu | NYC | 0.5 |
2019 | 20 | 40 | Malibu | NYC | 0.5 |
2020 | 17 | 44 | Malibu | NYC | 0.5 |
2018 | 8 | 60 | Sydney | Dublin | 0.113 |
2019 | 5 | 55 | Sydney | Dublin | 0.113 |
I already sorted the dataframe by its columns, but I have issues swapping the 3rd row from NYC/Malibu to Malibu/NYC with its populations. Besides that, I can only calculate the w_mean
for each row but not for each group. I tried groupby().mean()
but didn't get any useful output.
Current code:
import pandas as pd
data = pd.DataFrame({'year': ["2019", "2018", "2018", "2020", "2019"], 'pop0': [20,8,36,17,5], 'pop1': [40,60,23,44,55], 'city0': ['Malibu','Sydney','NYC','Malibu','Sydney'], 'city1': ['NYC','Dublin','Malibu','NYC','Dublin']})
new = data.sort_values(by=['city0', 'city1'])
new['w_mean'] = new.apply(lambda row: row.pop0 / row.pop1, axis=1)
print(new)
What you can do is creating a creating tuples of (city, population)
, put the two tuples in a row into a list and then sort it. By doing this for all rows, you can extract the new cities and populations (sorted alphabetically by city). This can be done as follows:
cities = [sorted([(e[0], e[1]), (e[2], e[3])]) for e in data[['city0','pop0','city1','pop1']].values]
data[['city0', 'pop0']] = [e[0] for e in cities]
data[['city1', 'pop1']] = [e[1] for e in cities]
Resulting dataframe:
year pop0 pop1 city0 city1
0 2019 20 40 Malibu NYC
1 2018 60 8 Dublin Sydney
2 2018 23 36 Malibu NYC
3 2020 17 44 Malibu NYC
4 2019 55 5 Dublin Sydney
Now, the mean_w
column can be created using groupby
and transform
to create the two sums and then divide as follows:
data[['pop0_sum', 'pop1_sum']] = data.groupby(['city0', 'city1'])[['pop0', 'pop1']].transform('sum')
data['w_mean'] = data['pop0_sum'] / data['pop1_sum']
Result:
year pop0 pop1 city0 city1 pop0_sum pop1_sum w_mean
0 2019 20 40 Malibu NYC 60 120 0.500000
1 2018 60 8 Dublin Sydney 115 13 8.846154
2 2018 23 36 Malibu NYC 60 120 0.500000
3 2020 17 44 Malibu NYC 60 120 0.500000
4 2019 55 5 Dublin Sydney 115 13 8.846154
Any extra columns can now be dropped.
If the resulting w_mean
column should always be less than zero, then the last division can be done as follows instead:
data['w_mean'] = np.where(data['pop0_sum'] > data['pop1_sum'], data['pop1_sum'] / data['pop0_sum'], data['pop0_sum'] / data['pop1_sum'])
This will give 0.5
for the Malibu & NYC pair and 0.113043
for Dublin & Sydney.