pythonpandasdataframe

How can plot categorical stacked bar over periodic time?


I'm trying to extract a stacked bar chart over periodic time (5 years):

import pandas as pd
categorical = ["RL","CD(others)","DL","ML","ML","ML","DL","ML","DL","DL"]
year = [2014,2014,2015,2015,2016,2017,2019,2021,2022,2022]
df = pd.DataFrame({'year':year,
                   'keywords':categorical})
df

I tried relevant post1, post2, post3 to resolve the problem:

#solution1:Pivot table
df.pivot_table(index='year',
               columns='keywords',
              # values='paper_count',
               aggfunc='sum')
#df.plot(x='year', y='paper_count', kind='bar')


#solution2: groupby
# reset_index() gives a column for counting after groupby uses year and category
ctdf = (df.reset_index()
          .groupby(['year'], as_index=False)
          .count()
          # rename isn't strictly necessary here; it's just for readability
          .rename(columns={'index':'paper_count'})
       )
ctdf.plot(x='year', y='paper_count', kind='bar')

At the end, I couldn't figure out how can plot this periodically by counting every 5 yrs:

2000-2005, 2005-2010, 2015-2020, 2020-2025.

expected output:

enter image description here


Solution

  • I don't understand the full logic if the provided example is supposed to match the data, but you can use pandas.cut to form bins, then cumsum to get the cumulated sum (remove this if you just want a simple sum):

    years = list(range(2000, 2030, 5))
    # [2000, 2005, 2010, 2015, 2020, 2025]
    labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
    # ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
    
    (df.assign(year=pd.cut(df['year'], bins=years, labels=labels))
       .groupby(['year', 'keywords'])['year'].count()
       .unstack()
       .plot.bar(stacked=True)
    )
    

    With the red line:

    years = list(range(2000, 2030, 5))
    # [2000, 2005, 2010, 2015, 2020, 2025]
    labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
    # ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
    
    df2 = (df
     .assign(year=pd.cut(df['year'], bins=years, labels=labels))
     .groupby(['year', 'keywords'])['year'].count()
     .unstack()
    )
    
    ax = df2.plot.bar(stacked=True)
    # adding arbitrary shift (0.1)
    df2.sum(axis=1).add(0.1).plot(ax=ax, color='red', marker='s', label='paper count')
    ax.legend()
    

    output:

    enter image description here