Say I have the following data:
Date,release,count
2019-03-01,buster,0
2019-03-01,jessie,1
2019-03-01,stretch,74
2019-08-15,buster,25
2019-08-15,jessie,1
2019-08-15,stretch,49
2019-10-07,buster,35
2019-10-07,jessie,1
2019-10-07,stretch,43
2019-10-08,buster,40
2019-10-08,jessie,1
2019-10-08,stretch,38
2019-10-09,buster,46
2019-10-09,jessie,1
2019-10-09,stretch,33
2019-10-23,buster,46
2019-10-23,jessie,1
2019-10-23,stretch,31
2019-11-25,buster,46
2019-11-25,jessie,1
2019-11-25,stretch,29
2020-01-13,buster,48
2020-01-13,jessie,1
2020-01-13,stretch,28
2020-01-29,buster,50
2020-01-29,jessie,1
2020-01-29,stretch,26
2020-03-10,buster,54
2020-03-10,jessie,1
2020-03-10,stretch,22
2020-04-14,buster,55
2020-04-14,jessie,0
2020-04-14,stretch,21
2020-05-11,buster,57
2020-05-11,jessie,0
2020-05-11,stretch,17
2020-05-25,buster,61
2020-05-25,jessie,0
2020-05-25,stretch,14
2020-06-10,buster,62
2020-06-10,stretch,12
2020-07-01,buster,69
2020-07-01,stretch,3
2020-10-30,buster,74
2020-10-30,stretch,2
2020-11-18,buster,76
2020-11-18,stretch,2
2021-08-26,bullseye,1
2021-08-26,buster,86
2021-08-26,stretch,1
2021-10-08,bullseye,4
2021-10-08,buster,86
2021-10-08,stretch,1
2021-11-11,bullseye,4
2021-11-11,buster,84
2021-11-11,stretch,1
2021-11-17,bullseye,4
2021-11-17,buster,85
2021-11-17,stretch,0
And the following code:
import pandas as pd
import matplotlib.pyplot as plt
# Load the data
df = pd.read_csv('subset.csv')
# Pivot the data to a suitable format for plotting
df = df.pivot_table(index="Date", columns='release', values='count', aggfunc='sum')
# Convert the index to datetime and sort it
df.index = pd.to_datetime(df.index)
print(df)
# Plotting the data with filled areas
fig, ax = plt.subplots(figsize=(12, 6))
df.plot(ax=ax, kind="area", stacked=True)
plt.show()
It generates the following graph:
In the above graph, the jessie
line should have stopped after 2020-05-25
, in the middle of the graph. But it just keeps going, a little energizer bunny of a line, all the way to the right of the graph, even though it's actually NaN
. In the print(df)
output, we can see this is the underlying dataframe after the pivot:
release bullseye buster jessie stretch
Date
2019-03-01 NaN 0.0 1.0 74.0
2019-08-15 NaN 25.0 1.0 49.0
2019-10-07 NaN 35.0 1.0 43.0
2019-10-08 NaN 40.0 1.0 38.0
2019-10-09 NaN 46.0 1.0 33.0
2019-10-23 NaN 46.0 1.0 31.0
2019-11-25 NaN 46.0 1.0 29.0
2020-01-13 NaN 48.0 1.0 28.0
2020-01-29 NaN 50.0 1.0 26.0
2020-03-10 NaN 54.0 1.0 22.0
2020-04-14 NaN 55.0 0.0 21.0
2020-05-11 NaN 57.0 0.0 17.0
2020-05-25 NaN 61.0 0.0 14.0
2020-06-10 NaN 62.0 NaN 12.0
2020-07-01 NaN 69.0 NaN 3.0
2020-10-30 NaN 74.0 NaN 2.0
2020-11-18 NaN 76.0 NaN 2.0
2021-08-26 1.0 86.0 NaN 1.0
2021-10-08 4.0 86.0 NaN 1.0
2021-11-11 4.0 84.0 NaN 1.0
2021-11-17 4.0 85.0 NaN 0.0
Interestly, if you look closely, you can also see the "bullseye" (blue) line is actually present since the beginning of the graph as well.
So, what's going on? Is matplotlib or pandas or something in there plotting NaN as "zero" instead of "not in this graph?
And dropna
is not the answer here: it drops entires rows or columns, I would need to drop cell which makes no sense here.
Note that my previous iteration of this graph, using bars, doesn't have that issue:
Simply replace area
with bar
in the above to reproduce. The problem with the bar graph is it doesn't respect the scale of the X axis (time).
You should set the line width to zero:
ax = plt.subplot()
df.plot(ax=ax, kind='area', lw=0, stacked=True)
Output:
The same issue also happens with a line plot, in which case the above would evidently not be a solution.
In this case, on can compute the stacked data with cumsum
:
df.cumsum(axis=1).plot()
Output: