I have this function, that takes dataframe with the data about articles of life expectancy in different regions and countries. I want to count the proportion of articles of each region in comparison to all articles,and also to count proportions of articles about male and female among each region. My question is how can I replace "for loop" in order to make small dataframe through the function calc_proportion? This function takes all the unique regions in Dataframe and counts proportions for each of them.
I want to have this kind of dataframe from function calc_proportion.
def calc_proportion(df):
proportions = pd.DataFrame(columns=['Region', 'Proportion_of_all_articles', 'Proportion_male_articles', 'Proportion_female_articles', 'Proportion_bs_articles'])
Regions = df.Region.unique()
for region in Regions:
a = f"{df.loc[df['Region'] == region].shape[0] / df.shape[0] : .0%}"
b = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Male')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
c = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Female')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
d = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Both sexes')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
proportions.loc[len(proportions)] = [region, a, b, c, d]
return proportions
calc_proportion(df)
So I want to get small dataframe of proportions in 'out' without using for loop in function.
Initial data:
import pandas as pd
import numpy as np
np.random.seed(0) # for reproducibility
regions = ['Africa', 'Americas', 'Eastern Mediterranean', 'Europe',
'South_East Asia']
sexes = ['Male', 'Female', 'Both sexes']
sexes = ['Male', 'Female', 'Both sexes']
data = {'Region': np.random.choice(regions, 15),
'Sex': np.random.choice(sexes, 15)}
df = pd.DataFrame(data)
df
Region Sex
0 South_East Asia Female
1 Africa Female
2 Europe Female
3 Europe Female
4 Europe Male
5 Americas Female
6 Europe Male
7 Eastern Mediterranean Male
8 South_East Asia Female
9 Africa Both sexes
10 Africa Male
11 South_East Asia Both sexes
12 Eastern Mediterranean Male
13 Americas Female
14 Africa Female
Here's one approach:
df.groupby
on "Region" and apply groupby.value_counts
with normalize
parameter set to True
to get distribution per region.df.unstack
to pivot the second index level (with the "sexes").df["Region"]
(Series.value_counts
). We use df.join
to join the two results.df.fillna
to fill NaN
values with 0
.df.rename
to change the column names.df.loc
, and reset the index with df.reset_index
.Code
# dict for renaming col names at end
cols_rename = {'Region': 'Proportion_of_all_articles',
'Male': 'Proportion_male_articles',
'Female': 'Proportion_female_articles',
'Both sexes': 'Proportion_bs_articles'}
out = (df.groupby('Region')['Sex']
.value_counts(normalize=True)
.unstack('Sex')
.join(
df['Region'].value_counts(normalize=True)
)
.fillna(0)
.rename(columns=cols_rename)
.loc[:, cols_rename.values()]
.reset_index(drop=False)
)
Result
out
Region Proportion_of_all_articles \
0 Africa 0.266667
1 Americas 0.133333
2 Eastern Mediterranean 0.133333
3 Europe 0.266667
4 South_East Asia 0.200000
Proportion_male_articles Proportion_female_articles \
0 0.25 0.500000
1 0.00 1.000000
2 1.00 0.000000
3 0.50 0.500000
4 0.00 0.666667
Proportion_bs_articles
0 0.250000
1 0.000000
2 0.000000
3 0.000000
4 0.333333
Formatted result
Seeing that you are working in Jupyter Notebook, I'd suggest using df.style.format
to print the result with the floats as percentages:
out.style.format({
col: lambda x: "{: .0f}%".format(x*100) for col in out.columns if 'Proportion' in col
})