pcolormesh and memory use after adding to plot

If I have code like this :

import matplotlib.pyplot as plt
import numpy as np

time_axis = np.arange(0, 30 + 1 / 50, 1 / 50)
another_axis = np.arange(0, 10, 1)

grid = np.zeros((len(another_axis), len(time_axis)))
fig, ax = plt.subplots(12, 1, figsize=(10, 15), constrained_layout=True)

for i in range(12):
    grid = np.random.random((len(another_axis), len(time_axis)))
    pcm = ax[i].pcolormesh(time_axis, another_axis, grid, vmin=0, vmax=1)
    fig.colorbar(pcm, ax=ax[i])

plt.show()

I have a problem when not all 12 grids can fit into memory at once. At the point that plt.show() is called, is matplotlib holding memory proportional to 12*grid.size? Is there anything I can do to reduce this or is this just the way plotting and pcolormesh work?

Solution

Summary

At the point that plt.show() is called, is matplotlib holding memory proportional to 12*grid.size?

Based on the small benchmark below : Yes

Is there anything I can do to reduce this or is this just the way plotting and pcolormesh work?

No, it can't be reduced. This is the way plotting works. See below for some optimizations on rendering time, which are mainly related to reducing the number of lines or points, i.e. "decluttering". Note that some web-based solutions do use Douglas-Pucker for lines simplification depending on zoom.

e.g. leaflet https://leafletjs.com/reference.html#lineutil-simplify

Dramatically reduces the number of points in a polyline while retaining its shape and returns a new array of simplified points, using the Ramer-Douglas-Peucker algorithm. Used for a huge performance boost when processing/displaying Leaflet polylines for each zoom level and also reducing visual noise. tolerance affects the amount of simplification (lesser value means higher quality but slower and with more points). Also released as a separated micro-library Simplify.js.

e.g. folium Is there a limit to plotting markers with folium?

As for your other request :

It would be nice if the plot could be partly rendered to save space or something along these lines.

Depending on the data (I assume this is not some random noise), you can consider contourf instead of colormesh, as seen here : https://matplotlib.org/stable/gallery/images_contours_and_fields/pcolormesh_levels.html#pcolormesh

I did not test the performance of both, but the benchmark would be fairly doable based on the matplotlib example, just add more points.

Otherwise, you can just save individual subplot as figures to file system, but you'll obviously loose the interactivity.

Another approach would be to perform some analytics, such as distribution visualisation (kde, whiskerbox, ...), regression, try to fit CDF/PDF laws, clustering, ... all of which could programmatically help you narrow things down to relevant data. See the strategy propose in this answer (see Tools and benchmark below)

BENCHMARK

I did a small benchmark, using a Python object size estimator based on this SO thread. However, it seemed fairly obvious that matplotlib structure would grow linearly with added data.

Fig size in memory increases linearly with nb time steps of your data and nb of subplots.

import itertools
import sys
from collections import deque
from collections.abc import Mapping, Set
from gc import get_referents
from numbers import Number
from pathlib import Path
from types import FunctionType, ModuleType

import numpy as np
import pandas as pd
import seaborn as sns

print(sys.version)
import tqdm as __tqdm
import matplotlib as mpl

for _m in [np, pd, __tqdm, mpl, sns, ]:
    print(f"{_m.__name__:>14s} - {_m.__version__:12s}")

import matplotlib.pyplot as plt

from tqdm.auto import tqdm

# Custom objects know their class.
# Function objects seem to know way too much, including modules.
# Exclude modules as well.
BLACKLIST = type, ModuleType, FunctionType


def getsize_v1(obj):
    """
    sum size of object & members.
    https://stackoverflow.com/a/30316760/7237062
    first function
    """
    if isinstance(obj, BLACKLIST):
        raise TypeError('getsize() does not take argument of type: ' + str(type(obj)))
    seen_ids = set()
    size = 0
    objects = [obj]
    while objects:
        need_referents = []
        for obj in objects:
            if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                seen_ids.add(id(obj))
                size += sys.getsizeof(obj)
                need_referents.append(obj)
        objects = get_referents(*need_referents)
    return size


ZERO_DEPTH_BASES = (str, bytes, Number, range, bytearray)


def getsize_custom(obj_0):
    """
    Recursively iterate to sum size of object & members.
    https://stackoverflow.com/a/30316760/7237062
    2nd implementation
    """
    _seen_ids = set()

    def inner(obj):
        obj_id = id(obj)
        if obj_id in _seen_ids:
            return 0
        _seen_ids.add(obj_id)
        size = sys.getsizeof(obj)
        if isinstance(obj, ZERO_DEPTH_BASES):
            pass  # bypass remaining control flow and return
        elif isinstance(obj, (tuple, list, Set, deque)):
            size += sum(inner(i) for i in obj)
        elif isinstance(obj, Mapping) or hasattr(obj, 'items'):
            size += sum(inner(k) + inner(v) for k, v in getattr(obj, 'items')())
        # Check for custom object instances - may subclass above too
        if hasattr(obj, '__dict__'):
            size += inner(vars(obj))
        if hasattr(obj, '__slots__'):  # can have __slots__ with __dict__
            size += sum(inner(getattr(obj, s)) for s in obj.__slots__ if hasattr(obj, s))
        return size

    return inner(obj_0)


# don't let matplotlib try things related to rendering (ex: if using an interactive console)
mpl.rcParams["interactive"] = False

# takes time, so save data and analyze later
root_dir = Path.home() / "StackOverflow" / "Q_78714345"
root_dir.mkdir(parents=True, exist_ok=True)
output_file_path_fig = (root_dir / "results_fig.csv").expanduser().absolute()
output_file_path_subplots = (root_dir / "results_subplots.csv").expanduser().absolute()
print(output_file_path_fig.as_uri())  # I have very bad memory
print(output_file_path_subplots.as_uri())  # I have very bad memory

# benchmark fig and subplot memory inflation
print("BENCHMARK".center(50, " ").center(80, "*", ))
subplots_results = []
fig_results = []
# nb_subplots = 12
subplots = [2, 4, 8, 16, 32]
data_points = [10, 50, 100, 500, 1_000, 5_000, 10_000, 25_000]  # can't go futher, low perf computer at hand ...
combinatorics = list(itertools.product(subplots, data_points))

another_axis = np.arange(0, 10, 1)
for nb_subplots, time_steps in tqdm(combinatorics, desc="Generating test cases"):
    time_axis = np.arange(0, 30 + 1 / time_steps, 1 / time_steps)
    grid = np.zeros((len(another_axis), len(time_axis)))
    # print(f"Grid shape {grid.shape}".center(50, " ").center(80, "*", ))
    fig, ax = plt.subplots(nrows=nb_subplots, ncols=1, )  # figsize=(10, 15), constrained_layout=True)
    initial_fig_size = getsize_custom(fig)
    for i in range(nb_subplots):
        _ax = ax[i]
        initial = getsize_custom(_ax)
        grid = np.random.random((len(another_axis), len(time_axis)))
        pcm = _ax.pcolormesh(time_axis, another_axis, grid, vmin=0, vmax=1)
        fig.colorbar(pcm, ax=_ax)
        final = getsize_custom(_ax)
        # subplots_results.append((time_steps, nb_subplots, i, initial, "initial"))
        # subplots_results.append((time_steps, nb_subplots, i, final, "final"))
        # subplots_results.append((time_steps, nb_subplots, i, final / initial, "ratio"))
    final_fig_size = getsize_custom(fig)
    fig_results.append((time_steps, nb_subplots, initial_fig_size, "initial"))
    fig_results.append((time_steps, nb_subplots, final_fig_size, "final"))
    fig_results.append((time_steps, nb_subplots, final_fig_size / initial_fig_size, "ratio"))
    # plt.show() # don't !
    plt.clf()
    plt.close()
    del fig
    del ax

# df_subplots = pd.DataFrame(
#        subplots_results,
#        columns=["Time steps", "nb_subplots", "subplot nb", "size", "Moment"]
#        )
df_figure = pd.DataFrame(
        fig_results,
        columns=["Time steps", "nb_subplots", "Fig size", "Moment"]
        )

print("SAVE DATAFRAMES".center(50, " ").center(80, "*", ))
df_figure.to_csv(output_file_path_fig, sep=";", index=False)
# df_subplots.to_csv(output_file_path_subplots, sep=";", index=False)

tmp_ = df_fig[df_fig["Moment"] != "ratio"].copy()

tmp_["Moment"] =tmp_["Moment"].astype("category")
tmp_["nb_subplots"] =tmp_["nb_subplots"].astype("category")

sns.set_theme(style="darkgrid")
sns.set_context("paper")
g = sns.relplot(data=tmp_, x="Time steps", y="Fig size", style="Moment", hue="nb_subplots", kind="line", markers=True,)
g.set_ylabels("Size in bytes")
plt.show()

versions used:

3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)]
numpy                - 2.0.0       
pandas               - 2.2.2       
tqdm                 - 4.66.4      
matplotlib           - 3.9.1       
seaborn              - 0.13.2

Tools benchmark

You may want to read this SO Q/A

matplotlib could not handle 10 millions points. Other tools did. see "Summary of results"

Depending on the exact details of your problem, you should consider something else. The above post contains an overview of several tools for large data.

But I did not see anything related to colormesh, even though I suspect this particular plot is subject to the same results.

Additional info concerning matplotlib (time) performance, and backends

Quoting https://matplotlib.org/stable/users/explain/artists/performance.html

Whether exploring data in interactive mode or programmatically saving lots of plots, rendering performance can be a challenging bottleneck in your pipeline. Matplotlib provides multiple ways to greatly reduce rendering time at the cost of a slight change (to a settable tolerance) in your plot's appearance. The methods available to reduce rendering time depend on the type of plot that is being created.

They indicate that you can tune the display rendering performance with some of the tweaks below (shortly reproduced here, the link gets into details of the how/why).

However, there are no paragraph related to in-memory strategy of the elements to plot. Unless you reduce the set of items to display yourself, which you do not seem to want.

simplify lines directives

mpl.rcParams['path.simplify'] = True
mpl.rcParams['path.simplify_threshold'] = 0.0
# or
mpl.rcParams['path.simplify_threshold'] = 1.0

marker subsampling

plt.plot(x, y, markevery=10)

Using the fast style

import matplotlib.style as mplstyle
mplstyle.use('fast')

In addition, in this link the author gives details on some memory leaks situations with matplotlib and how to deal with them.

It boils down to :

# plot figures
# ...
# properly clean and close
plt.clf()
plt.close()

But again, this works only if matplotlib does not explode at the plt.show() step.

On the front-end side of matplotlib, there isn't much you can do to have an interactive plot with lots of data.

On the backend side, you may have other options, but I can't tell which one works.

See this page of Matplotlib documentation on backends

There are three ways to configure your backend:

The rcParams["backend"] parameter in your matplotlibrc file

The MPLBACKEND environment variable

The function matplotlib.use()

It then goes into great depths:

To make things easily more customizable for graphical user interfaces, Matplotlib separates the concept of the renderer (the thing that actually does the drawing) from the canvas (the place where the drawing goes). The canonical renderer for user interfaces is Agg which uses the Anti-Grain Geometry C++ library to make a raster (pixel) image of the figure; it is used by the QtAgg, GTK4Agg, GTK3Agg, wxAgg, TkAgg, and macosx backends. An alternative renderer is based on the Cairo library, used by QtCairo, etc.

I suppose that many rendering optimizations occur within Agg. If RAM explodes there (so, not before within the frontend part ...), not much you can do. I bet that the C++ code is a fairly optimized one.

For the rendering engines, users can also distinguish between vector or raster renderers. Vector graphics languages issue drawing commands like "draw a line from this point to this point" and hence are scale free. Raster backends generate a pixel representation of the line whose accuracy depends on a DPI setting.

There, you can find a list of alternative interactive backends. Maybe try some vector ones ?