pythonpandasseaborncountplot

countplot with normalized y axis per group


I was wondering if it is possible to create a Seaborn count plot, but instead of actual counts on the y-axis, show the relative frequency (percentage) within its group (as specified with the hue parameter).

I sort of fixed this with the following approach, but I can't imagine this is the easiest approach:

# Plot percentage of occupation per income class
grouped = df.groupby(['income'], sort=False)
occupation_counts = grouped['occupation'].value_counts(normalize=True, sort=False)

occupation_data = [
    {'occupation': occupation, 'income': income, 'percentage': percentage*100} for 
    (income, occupation), percentage in dict(occupation_counts).items()
]

df_occupation = pd.DataFrame(occupation_data)

p = sns.barplot(x="occupation", y="percentage", hue="income", data=df_occupation)
_ = plt.setp(p.get_xticklabels(), rotation=90)  # Rotate labels

Result:

Percentage plot with seaborn

I'm using the well known adult data set from the UCI machine learning repository. The pandas dataframe is created like this:

# Read the adult dataset
df = pd.read_csv(
    "data/adult.data",
    engine='c',
    lineterminator='\n',

    names=['age', 'workclass', 'fnlwgt', 'education', 'education_num',
           'marital_status', 'occupation', 'relationship', 'race', 'sex',
           'capital_gain', 'capital_loss', 'hours_per_week',
           'native_country', 'income'],
    header=None,
    skipinitialspace=True,
    na_values="?"
)

This question is sort of related, but does not make use of the hue parameter. And in my case I cannot just change the labels on the y-axis, because the height of the bar must depend on the group.


Solution

  • With newer versions of seaborn you can do following:

    import numpy as np
    import pandas as pd
    import seaborn as sns
    sns.set(color_codes=True)
    
    df = sns.load_dataset('titanic')
    df.head()
    
    x,y = 'class', 'survived'
    
    (df
    .groupby(x)[y]
    .value_counts(normalize=True)
    .mul(100)
    .rename('percent')
    .reset_index()
    .pipe((sns.catplot,'data'), x=x,y='percent',hue=y,kind='bar'))
    
    
    

    output

    enter image description here

    Update: Also show percentages on top of barplots

    If you also want percentages, you can do following:

    import numpy as np
    import pandas as pd
    import seaborn as sns
    
    df = sns.load_dataset('titanic')
    df.head()
    
    x,y = 'class', 'survived'
    
    df1 = df.groupby(x)[y].value_counts(normalize=True)
    df1 = df1.mul(100)
    df1 = df1.rename('percent').reset_index()
    
    g = sns.catplot(x=x,y='percent',hue=y,kind='bar',data=df1)
    g.ax.set_ylim(0,100)
    
    for p in g.ax.patches:
        txt = str(p.get_height().round(2)) + '%'
        txt_x = p.get_x() 
        txt_y = p.get_height()
        g.ax.text(txt_x,txt_y,txt)
    

    enter image description here