pythonpandasmatplotlibnan

pandas fails to hide NaN entries from stacked line graphs


Say I have the following data:

Date,release,count
2019-03-01,buster,0
2019-03-01,jessie,1
2019-03-01,stretch,74
2019-08-15,buster,25
2019-08-15,jessie,1
2019-08-15,stretch,49
2019-10-07,buster,35
2019-10-07,jessie,1
2019-10-07,stretch,43
2019-10-08,buster,40
2019-10-08,jessie,1
2019-10-08,stretch,38
2019-10-09,buster,46
2019-10-09,jessie,1
2019-10-09,stretch,33
2019-10-23,buster,46
2019-10-23,jessie,1
2019-10-23,stretch,31
2019-11-25,buster,46
2019-11-25,jessie,1
2019-11-25,stretch,29
2020-01-13,buster,48
2020-01-13,jessie,1
2020-01-13,stretch,28
2020-01-29,buster,50
2020-01-29,jessie,1
2020-01-29,stretch,26
2020-03-10,buster,54
2020-03-10,jessie,1
2020-03-10,stretch,22
2020-04-14,buster,55
2020-04-14,jessie,0
2020-04-14,stretch,21
2020-05-11,buster,57
2020-05-11,jessie,0
2020-05-11,stretch,17
2020-05-25,buster,61
2020-05-25,jessie,0
2020-05-25,stretch,14
2020-06-10,buster,62
2020-06-10,stretch,12
2020-07-01,buster,69
2020-07-01,stretch,3
2020-10-30,buster,74
2020-10-30,stretch,2
2020-11-18,buster,76
2020-11-18,stretch,2
2021-08-26,bullseye,1
2021-08-26,buster,86
2021-08-26,stretch,1
2021-10-08,bullseye,4
2021-10-08,buster,86
2021-10-08,stretch,1
2021-11-11,bullseye,4
2021-11-11,buster,84
2021-11-11,stretch,1
2021-11-17,bullseye,4
2021-11-17,buster,85
2021-11-17,stretch,0

And the following code:

import pandas as pd
import matplotlib.pyplot as plt

# Load the data
df = pd.read_csv('subset.csv')

# Pivot the data to a suitable format for plotting
df = df.pivot_table(index="Date", columns='release', values='count', aggfunc='sum')

# Convert the index to datetime and sort it
df.index = pd.to_datetime(df.index)

print(df)

# Plotting the data with filled areas
fig, ax = plt.subplots(figsize=(12, 6))
df.plot(ax=ax, kind="area", stacked=True)

plt.show()

It generates the following graph:

enter image description here

In the above graph, the jessie line should have stopped after 2020-05-25, in the middle of the graph. But it just keeps going, a little energizer bunny of a line, all the way to the right of the graph, even though it's actually NaN. In the print(df) output, we can see this is the underlying dataframe after the pivot:

release     bullseye  buster  jessie  stretch
Date                                         
2019-03-01       NaN     0.0     1.0     74.0
2019-08-15       NaN    25.0     1.0     49.0
2019-10-07       NaN    35.0     1.0     43.0
2019-10-08       NaN    40.0     1.0     38.0
2019-10-09       NaN    46.0     1.0     33.0
2019-10-23       NaN    46.0     1.0     31.0
2019-11-25       NaN    46.0     1.0     29.0
2020-01-13       NaN    48.0     1.0     28.0
2020-01-29       NaN    50.0     1.0     26.0
2020-03-10       NaN    54.0     1.0     22.0
2020-04-14       NaN    55.0     0.0     21.0
2020-05-11       NaN    57.0     0.0     17.0
2020-05-25       NaN    61.0     0.0     14.0
2020-06-10       NaN    62.0     NaN     12.0
2020-07-01       NaN    69.0     NaN      3.0
2020-10-30       NaN    74.0     NaN      2.0
2020-11-18       NaN    76.0     NaN      2.0
2021-08-26       1.0    86.0     NaN      1.0
2021-10-08       4.0    86.0     NaN      1.0
2021-11-11       4.0    84.0     NaN      1.0
2021-11-17       4.0    85.0     NaN      0.0

Interestly, if you look closely, you can also see the "bullseye" (blue) line is actually present since the beginning of the graph as well.

So, what's going on? Is matplotlib or pandas or something in there plotting NaN as "zero" instead of "not in this graph?

And dropna is not the answer here: it drops entires rows or columns, I would need to drop cell which makes no sense here.

Note that my previous iteration of this graph, using bars, doesn't have that issue:

enter image description here

Simply replace area with bar in the above to reproduce. The problem with the bar graph is it doesn't respect the scale of the X axis (time).


Solution

  • You should set the line width to zero:

    ax = plt.subplot()
    df.plot(ax=ax, kind='area', lw=0, stacked=True)
    

    Output:

    pandas stacked area plot with NaN

    with a line plot

    The same issue also happens with a line plot, in which case the above would evidently not be a solution.

    In this case, on can compute the stacked data with cumsum:

    df.cumsum(axis=1).plot()
    

    Output:

    pandas stacked line plot with NaN