pythonmatlab

Switching to Python from MATLAB: plotting and RAM consumption


I'm in the process of switching from MATLAB to Python. I'm rewriting most of my codes and am using Spyder IDE.

I constantly need to load large datasets of data and plot it. Here I've encountered an issue with my use of Python and RAM consumption when plotting the data

Here's an example of two codes, first in MATLAB and the RAM usage at each point, and then implementation in Python. The goal is to read several HDF5 datasets, stitch them together and plot.

MATLAB:

filename = dir('*00.hdf5');
[~, idx]   = max([filename.datenum]);
filename = filename(idx).name;
dLim = [0 1];
Data = [];
lgth = cellstr(num2str((dLim(1):dLim(2))', '%04d'));
date = filename(1:17);
for i = 1:length(lgth)
    filename1 = [date ,sprintf('.00%s', lgth{i}),'.hdf5'];
    intD = double(h5read(filename1,'/data'));
    Data = [Data; intD];
end

Final size of Data = 58976577x4 RAM usage = 3662 MB

after plot with MATLAB

RAM usage = 7322 MB.

Python:

filenames = glob.glob('*00.hdf5')
filename = max(filenames, key=os.path.getmtime)
filename = os.path.basename(filename)
dLim = [0, 1]
lgth = [f"{i:04d}" for i in range(dLim[0], dLim[1] + 1)]
date = filename[:17]
Data = None
for i in range(len(dLim)):
    filename1 = f"{date}.00{lgth[i]}.hdf5"
    with h5py.File(filename1, 'r') as f:
        intD = np.array(f['/data']).T
        if Data is None:
            Data = intD
        else:
            Data = np.vstack((Data, intD))  

Final size of Data = 58976577x4

RAM usage = 2280 MB

after plotting with matplotlib:

from matplotlib import pyplot as plt

plt.plot(Data)

RAM usage = 11000+ MB.

Summary: MATLAB RAM consumption:

Final size of Data = 58976577x4

RAM usage = 3662 MB

After plot:

RAM usage = 7322 MB

Python (Spyder) RAM consumption:

Final size of Data = 58976577x4

RAM usage = 2280 MB

After plot with matplotlib:

RAM usage = 11000+ MB. Becomes barely responsive.

What am I doing wrong? This is not even the largest dataset that I have worked with before, and I'm stuck... I can downsample the data and reduce the load, but that's not the point. That can be done in MATLAB as well...


Solution

  • Ill break this down into different catagories because there is a lot to talk about.

    Data Handling:

    Matlab handles data more efficiently (in regards to memory) due to its internal optimizations. Python offers more flexibility but unless you have a beefed-out computer, it comes at the cost of your precious RAM especially when you use operations such as np.vstack which creates arrays in memory.

    Memory usage during plotting

    Matplotlib stores additional data when plotting such as unoptimized copies of data. You are able to optimize memory usage in Python. However, unlike MATLAB python isn't specialised to handle complex data generation in particular large memory allocations.

    Looking around, I found a optimisation for python (as shown below):

    data_list = []
    for i in range(len(lgth)):
        filename1 = f"{date}.00{lgth[i]}.hdf5"
        with h5py.File(filename1, 'r') as f:
            intD = np.array(f['/data']).T
            data_list.append(intD)
    Data = np.vstack(data_list)
    

    You can also utilise memory profiling tools like memory_profiler or traemalloc to understand where your memory is boing consumed the most. It will pinpoint specific lines/functions that are contributing to the high memory usage.

    However, Python is an object-orientated programming language. It is an easy-to-read and understand resulting in easier steps for developers. In real time Python isn't the best choice for you, It works well if you don't mind unoptimized options.