pythonmatplotlib

pcolormesh and memory use after adding to plot


If I have code like this :

import matplotlib.pyplot as plt
import numpy as np

time_axis = np.arange(0, 30 + 1 / 50, 1 / 50)
another_axis = np.arange(0, 10, 1)

grid = np.zeros((len(another_axis), len(time_axis)))
fig, ax = plt.subplots(12, 1, figsize=(10, 15), constrained_layout=True)

for i in range(12):
    grid = np.random.random((len(another_axis), len(time_axis)))
    pcm = ax[i].pcolormesh(time_axis, another_axis, grid, vmin=0, vmax=1)
    fig.colorbar(pcm, ax=ax[i])

plt.show()

I have a problem when not all 12 grids can fit into memory at once. At the point that plt.show() is called, is matplotlib holding memory proportional to 12*grid.size? Is there anything I can do to reduce this or is this just the way plotting and pcolormesh work?


Solution

  • Summary

    At the point that plt.show() is called, is matplotlib holding memory proportional to 12*grid.size?

    Based on the small benchmark below : Yes

    Is there anything I can do to reduce this or is this just the way plotting and pcolormesh work?

    No, it can't be reduced. This is the way plotting works. See below for some optimizations on rendering time, which are mainly related to reducing the number of lines or points, i.e. "decluttering". Note that some web-based solutions do use Douglas-Pucker for lines simplification depending on zoom.

    e.g. leaflet https://leafletjs.com/reference.html#lineutil-simplify

    Dramatically reduces the number of points in a polyline while retaining its shape and returns a new array of simplified points, using the Ramer-Douglas-Peucker algorithm. Used for a huge performance boost when processing/displaying Leaflet polylines for each zoom level and also reducing visual noise. tolerance affects the amount of simplification (lesser value means higher quality but slower and with more points). Also released as a separated micro-library Simplify.js.

    (see also https://leafletjs.com/reference.html#polyline-smoothfactor, https://leafletjs.com/reference.html#lineutil-clipsegment, https://leafletjs.com/reference.html#polyutil-clippolygon)

    e.g. folium Is there a limit to plotting markers with folium?

    As for your other request :

    It would be nice if the plot could be partly rendered to save space or something along these lines.

    Depending on the data (I assume this is not some random noise), you can consider contourf instead of colormesh, as seen here : https://matplotlib.org/stable/gallery/images_contours_and_fields/pcolormesh_levels.html#pcolormesh

    enter image description here

    I did not test the performance of both, but the benchmark would be fairly doable based on the matplotlib example, just add more points.

    Otherwise, you can just save individual subplot as figures to file system, but you'll obviously loose the interactivity.

    Another approach would be to perform some analytics, such as distribution visualisation (kde, whiskerbox, ...), regression, try to fit CDF/PDF laws, clustering, ... all of which could programmatically help you narrow things down to relevant data. See the strategy propose in this answer (see Tools and benchmark below)

    BENCHMARK

    I did a small benchmark, using a Python object size estimator based on this SO thread. However, it seemed fairly obvious that matplotlib structure would grow linearly with added data.

    Fig size in memory increases linearly with nb time steps of your data and nb of subplots.

    enter image description here

    import itertools
    import sys
    from collections import deque
    from collections.abc import Mapping, Set
    from gc import get_referents
    from numbers import Number
    from pathlib import Path
    from types import FunctionType, ModuleType
    
    import numpy as np
    import pandas as pd
    import seaborn as sns
    
    print(sys.version)
    import tqdm as __tqdm
    import matplotlib as mpl
    
    for _m in [np, pd, __tqdm, mpl, sns, ]:
        print(f"{_m.__name__:>14s} - {_m.__version__:12s}")
    
    import matplotlib.pyplot as plt
    
    from tqdm.auto import tqdm
    
    # Custom objects know their class.
    # Function objects seem to know way too much, including modules.
    # Exclude modules as well.
    BLACKLIST = type, ModuleType, FunctionType
    
    
    def getsize_v1(obj):
        """
        sum size of object & members.
        https://stackoverflow.com/a/30316760/7237062
        first function
        """
        if isinstance(obj, BLACKLIST):
            raise TypeError('getsize() does not take argument of type: ' + str(type(obj)))
        seen_ids = set()
        size = 0
        objects = [obj]
        while objects:
            need_referents = []
            for obj in objects:
                if not isinstance(obj, BLACKLIST) and id(obj) not in seen_ids:
                    seen_ids.add(id(obj))
                    size += sys.getsizeof(obj)
                    need_referents.append(obj)
            objects = get_referents(*need_referents)
        return size
    
    
    ZERO_DEPTH_BASES = (str, bytes, Number, range, bytearray)
    
    
    def getsize_custom(obj_0):
        """
        Recursively iterate to sum size of object & members.
        https://stackoverflow.com/a/30316760/7237062
        2nd implementation
        """
        _seen_ids = set()
    
        def inner(obj):
            obj_id = id(obj)
            if obj_id in _seen_ids:
                return 0
            _seen_ids.add(obj_id)
            size = sys.getsizeof(obj)
            if isinstance(obj, ZERO_DEPTH_BASES):
                pass  # bypass remaining control flow and return
            elif isinstance(obj, (tuple, list, Set, deque)):
                size += sum(inner(i) for i in obj)
            elif isinstance(obj, Mapping) or hasattr(obj, 'items'):
                size += sum(inner(k) + inner(v) for k, v in getattr(obj, 'items')())
            # Check for custom object instances - may subclass above too
            if hasattr(obj, '__dict__'):
                size += inner(vars(obj))
            if hasattr(obj, '__slots__'):  # can have __slots__ with __dict__
                size += sum(inner(getattr(obj, s)) for s in obj.__slots__ if hasattr(obj, s))
            return size
    
        return inner(obj_0)
    
    
    # don't let matplotlib try things related to rendering (ex: if using an interactive console)
    mpl.rcParams["interactive"] = False
    
    # takes time, so save data and analyze later
    root_dir = Path.home() / "StackOverflow" / "Q_78714345"
    root_dir.mkdir(parents=True, exist_ok=True)
    output_file_path_fig = (root_dir / "results_fig.csv").expanduser().absolute()
    output_file_path_subplots = (root_dir / "results_subplots.csv").expanduser().absolute()
    print(output_file_path_fig.as_uri())  # I have very bad memory
    print(output_file_path_subplots.as_uri())  # I have very bad memory
    
    # benchmark fig and subplot memory inflation
    print("BENCHMARK".center(50, " ").center(80, "*", ))
    subplots_results = []
    fig_results = []
    # nb_subplots = 12
    subplots = [2, 4, 8, 16, 32]
    data_points = [10, 50, 100, 500, 1_000, 5_000, 10_000, 25_000]  # can't go futher, low perf computer at hand ...
    combinatorics = list(itertools.product(subplots, data_points))
    
    another_axis = np.arange(0, 10, 1)
    for nb_subplots, time_steps in tqdm(combinatorics, desc="Generating test cases"):
        time_axis = np.arange(0, 30 + 1 / time_steps, 1 / time_steps)
        grid = np.zeros((len(another_axis), len(time_axis)))
        # print(f"Grid shape {grid.shape}".center(50, " ").center(80, "*", ))
        fig, ax = plt.subplots(nrows=nb_subplots, ncols=1, )  # figsize=(10, 15), constrained_layout=True)
        initial_fig_size = getsize_custom(fig)
        for i in range(nb_subplots):
            _ax = ax[i]
            initial = getsize_custom(_ax)
            grid = np.random.random((len(another_axis), len(time_axis)))
            pcm = _ax.pcolormesh(time_axis, another_axis, grid, vmin=0, vmax=1)
            fig.colorbar(pcm, ax=_ax)
            final = getsize_custom(_ax)
            # subplots_results.append((time_steps, nb_subplots, i, initial, "initial"))
            # subplots_results.append((time_steps, nb_subplots, i, final, "final"))
            # subplots_results.append((time_steps, nb_subplots, i, final / initial, "ratio"))
        final_fig_size = getsize_custom(fig)
        fig_results.append((time_steps, nb_subplots, initial_fig_size, "initial"))
        fig_results.append((time_steps, nb_subplots, final_fig_size, "final"))
        fig_results.append((time_steps, nb_subplots, final_fig_size / initial_fig_size, "ratio"))
        # plt.show() # don't !
        plt.clf()
        plt.close()
        del fig
        del ax
    
    # df_subplots = pd.DataFrame(
    #        subplots_results,
    #        columns=["Time steps", "nb_subplots", "subplot nb", "size", "Moment"]
    #        )
    df_figure = pd.DataFrame(
            fig_results,
            columns=["Time steps", "nb_subplots", "Fig size", "Moment"]
            )
    
    print("SAVE DATAFRAMES".center(50, " ").center(80, "*", ))
    df_figure.to_csv(output_file_path_fig, sep=";", index=False)
    # df_subplots.to_csv(output_file_path_subplots, sep=";", index=False)
    
    tmp_ = df_fig[df_fig["Moment"] != "ratio"].copy()
    
    tmp_["Moment"] =tmp_["Moment"].astype("category")
    tmp_["nb_subplots"] =tmp_["nb_subplots"].astype("category")
    
    sns.set_theme(style="darkgrid")
    sns.set_context("paper")
    g = sns.relplot(data=tmp_, x="Time steps", y="Fig size", style="Moment", hue="nb_subplots", kind="line", markers=True,)
    g.set_ylabels("Size in bytes")
    plt.show()
    

    versions used:

    3.9.6 (tags/v3.9.6:db3ff76, Jun 28 2021, 15:26:21) [MSC v.1929 64 bit (AMD64)]
    numpy                - 2.0.0       
    pandas               - 2.2.2       
    tqdm                 - 4.66.4      
    matplotlib           - 3.9.1       
    seaborn              - 0.13.2   
    

    Tools benchmark

    You may want to read this SO Q/A

    matplotlib could not handle 10 millions points. Other tools did. see "Summary of results"

    Depending on the exact details of your problem, you should consider something else. The above post contains an overview of several tools for large data.

    But I did not see anything related to colormesh, even though I suspect this particular plot is subject to the same results.

    Additional info concerning matplotlib (time) performance, and backends

    Quoting https://matplotlib.org/stable/users/explain/artists/performance.html

    Whether exploring data in interactive mode or programmatically saving lots of plots, rendering performance can be a challenging bottleneck in your pipeline. Matplotlib provides multiple ways to greatly reduce rendering time at the cost of a slight change (to a settable tolerance) in your plot's appearance. The methods available to reduce rendering time depend on the type of plot that is being created.

    They indicate that you can tune the display rendering performance with some of the tweaks below (shortly reproduced here, the link gets into details of the how/why).

    However, there are no paragraph related to in-memory strategy of the elements to plot. Unless you reduce the set of items to display yourself, which you do not seem to want.

    mpl.rcParams['path.simplify'] = True
    mpl.rcParams['path.simplify_threshold'] = 0.0
    # or
    mpl.rcParams['path.simplify_threshold'] = 1.0
    
    plt.plot(x, y, markevery=10)
    
    import matplotlib.style as mplstyle
    mplstyle.use('fast')
    

    In addition, in this link the author gives details on some memory leaks situations with matplotlib and how to deal with them.

    It boils down to :

    # plot figures
    # ...
    # properly clean and close
    plt.clf()
    plt.close()
    

    But again, this works only if matplotlib does not explode at the plt.show() step.


    On the front-end side of matplotlib, there isn't much you can do to have an interactive plot with lots of data.

    On the backend side, you may have other options, but I can't tell which one works.

    See this page of Matplotlib documentation on backends

    There are three ways to configure your backend:

    • The rcParams["backend"] parameter in your matplotlibrc file
    • The MPLBACKEND environment variable
    • The function matplotlib.use()

    It then goes into great depths:

    To make things easily more customizable for graphical user interfaces, Matplotlib separates the concept of the renderer (the thing that actually does the drawing) from the canvas (the place where the drawing goes). The canonical renderer for user interfaces is Agg which uses the Anti-Grain Geometry C++ library to make a raster (pixel) image of the figure; it is used by the QtAgg, GTK4Agg, GTK3Agg, wxAgg, TkAgg, and macosx backends. An alternative renderer is based on the Cairo library, used by QtCairo, etc.

    I suppose that many rendering optimizations occur within Agg. If RAM explodes there (so, not before within the frontend part ...), not much you can do. I bet that the C++ code is a fairly optimized one.

    For the rendering engines, users can also distinguish between vector or raster renderers. Vector graphics languages issue drawing commands like "draw a line from this point to this point" and hence are scale free. Raster backends generate a pixel representation of the line whose accuracy depends on a DPI setting.

    There, you can find a list of alternative interactive backends. Maybe try some vector ones ?