I'm trying to extract a stacked bar chart over periodic time (5 years):
import pandas as pd
categorical = ["RL","CD(others)","DL","ML","ML","ML","DL","ML","DL","DL"]
year = [2014,2014,2015,2015,2016,2017,2019,2021,2022,2022]
df = pd.DataFrame({'year':year,
'keywords':categorical})
df
I tried relevant post1, post2, post3 to resolve the problem:
#solution1:Pivot table
df.pivot_table(index='year',
columns='keywords',
# values='paper_count',
aggfunc='sum')
#df.plot(x='year', y='paper_count', kind='bar')
#solution2: groupby
# reset_index() gives a column for counting after groupby uses year and category
ctdf = (df.reset_index()
.groupby(['year'], as_index=False)
.count()
# rename isn't strictly necessary here; it's just for readability
.rename(columns={'index':'paper_count'})
)
ctdf.plot(x='year', y='paper_count', kind='bar')
At the end, I couldn't figure out how can plot this periodically by counting every 5 yrs:
2000-2005
, 2005-2010
, 2015-2020
, 2020-2025
.
expected output:
I don't understand the full logic if the provided example is supposed to match the data, but you can use pandas.cut
to form bins, then cumsum
to get the cumulated sum (remove this if you just want a simple sum):
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
(df.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
.plot.bar(stacked=True)
)
With the red line:
years = list(range(2000, 2030, 5))
# [2000, 2005, 2010, 2015, 2020, 2025]
labels = [f'{a}-{b}' for a,b in zip(years, years[1:])]
# ['2000-2005', '2005-2010', '2010-2015', '2015-2020', '2020-2025']
df2 = (df
.assign(year=pd.cut(df['year'], bins=years, labels=labels))
.groupby(['year', 'keywords'])['year'].count()
.unstack()
)
ax = df2.plot.bar(stacked=True)
# adding arbitrary shift (0.1)
df2.sum(axis=1).add(0.1).plot(ax=ax, color='red', marker='s', label='paper count')
ax.legend()
output: