pythonpandasnumpyfor-loopproportions

How to replace for loop in pandas Dataframe?


I have this function, that takes dataframe with the data about articles of life expectancy in different regions and countries. I want to count the proportion of articles of each region in comparison to all articles,and also to count proportions of articles about male and female among each region. My question is how can I replace "for loop" in order to make small dataframe through the function calc_proportion? This function takes all the unique regions in Dataframe and counts proportions for each of them.

I want to have this kind of dataframe from function calc_proportion.

def calc_proportion(df):
    proportions = pd.DataFrame(columns=['Region', 'Proportion_of_all_articles', 'Proportion_male_articles', 'Proportion_female_articles', 'Proportion_bs_articles'])
    Regions = df.Region.unique()
    for region in Regions:
        a = f"{df.loc[df['Region'] == region].shape[0] / df.shape[0] : .0%}"
        b = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Male')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        c = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Female')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        d = f"{df.loc[(df['Region'] == region) & (df['Sex'] == 'Both sexes')].shape[0] / df.loc[df['Region'] == region].shape[0] : .0%}"
        proportions.loc[len(proportions)] = [region, a, b, c, d]
    return proportions

calc_proportion(df)

Result: Result

So I want to get small dataframe of proportions in 'out' without using for loop in function.

Initial data:

Initial data


Solution

  • Minimal reproducible example

    import pandas as pd
    import numpy as np
    
    np.random.seed(0) # for reproducibility
    regions = ['Africa', 'Americas', 'Eastern Mediterranean', 'Europe', 
               'South_East Asia']
    sexes = ['Male', 'Female', 'Both sexes']
    sexes = ['Male', 'Female', 'Both sexes']
    
    data = {'Region': np.random.choice(regions, 15),
            'Sex': np.random.choice(sexes, 15)}
    
    df = pd.DataFrame(data)
    
    df
    
                       Region         Sex
    0         South_East Asia      Female
    1                  Africa      Female
    2                  Europe      Female
    3                  Europe      Female
    4                  Europe        Male
    5                Americas      Female
    6                  Europe        Male
    7   Eastern Mediterranean        Male
    8         South_East Asia      Female
    9                  Africa  Both sexes
    10                 Africa        Male
    11        South_East Asia  Both sexes
    12  Eastern Mediterranean        Male
    13               Americas      Female
    14                 Africa      Female
    

    Here's one approach:

    Code

    # dict for renaming col names at end
    cols_rename = {'Region': 'Proportion_of_all_articles',
                   'Male': 'Proportion_male_articles',
                   'Female': 'Proportion_female_articles',
                   'Both sexes': 'Proportion_bs_articles'}
    
    out = (df.groupby('Region')['Sex']
           .value_counts(normalize=True)
           .unstack('Sex')
           .join(
               df['Region'].value_counts(normalize=True)
               )
           .fillna(0)
           .rename(columns=cols_rename)
           .loc[:, cols_rename.values()]
           .reset_index(drop=False)
           )
    

    Result

    out
    
                      Region  Proportion_of_all_articles  \
    0                 Africa                    0.266667   
    1               Americas                    0.133333   
    2  Eastern Mediterranean                    0.133333   
    3                 Europe                    0.266667   
    4        South_East Asia                    0.200000   
    
       Proportion_male_articles  Proportion_female_articles  \
    0                      0.25                    0.500000   
    1                      0.00                    1.000000   
    2                      1.00                    0.000000   
    3                      0.50                    0.500000   
    4                      0.00                    0.666667   
    
       Proportion_bs_articles  
    0                0.250000  
    1                0.000000  
    2                0.000000  
    3                0.000000  
    4                0.333333
    

    Formatted result

    Seeing that you are working in Jupyter Notebook, I'd suggest using df.style.format to print the result with the floats as percentages:

    out.style.format({
        col: lambda x: "{: .0f}%".format(x*100) for col in out.columns if 'Proportion' in col
    })
    

    formatted result