I'm in the process of switching from MATLAB to Python. I'm rewriting most of my codes and am using Spyder IDE.
I constantly need to load large datasets of data and plot it. Here I've encountered an issue with my use of Python and RAM consumption when plotting the data
Here's an example of two codes, first in MATLAB and the RAM usage at each point, and then implementation in Python. The goal is to read several HDF5 datasets, stitch them together and plot.
MATLAB:
filename = dir('*00.hdf5');
[~, idx] = max([filename.datenum]);
filename = filename(idx).name;
dLim = [0 1];
Data = [];
lgth = cellstr(num2str((dLim(1):dLim(2))', '%04d'));
date = filename(1:17);
for i = 1:length(lgth)
filename1 = [date ,sprintf('.00%s', lgth{i}),'.hdf5'];
intD = double(h5read(filename1,'/data'));
Data = [Data; intD];
end
Final size of Data = 58976577x4 RAM usage = 3662 MB
after plot with MATLAB
RAM usage = 7322 MB.
Python:
filenames = glob.glob('*00.hdf5')
filename = max(filenames, key=os.path.getmtime)
filename = os.path.basename(filename)
dLim = [0, 1]
lgth = [f"{i:04d}" for i in range(dLim[0], dLim[1] + 1)]
date = filename[:17]
Data = None
for i in range(len(dLim)):
filename1 = f"{date}.00{lgth[i]}.hdf5"
with h5py.File(filename1, 'r') as f:
intD = np.array(f['/data']).T
if Data is None:
Data = intD
else:
Data = np.vstack((Data, intD))
Final size of Data = 58976577x4
RAM usage = 2280 MB
after plotting with matplotlib:
from matplotlib import pyplot as plt
plt.plot(Data)
RAM usage = 11000+ MB.
Summary: MATLAB RAM consumption:
Final size of Data = 58976577x4
RAM usage = 3662 MB
After plot:
RAM usage = 7322 MB
Python (Spyder) RAM consumption:
Final size of Data = 58976577x4
RAM usage = 2280 MB
After plot with matplotlib:
RAM usage = 11000+ MB. Becomes barely responsive.
What am I doing wrong? This is not even the largest dataset that I have worked with before, and I'm stuck... I can downsample the data and reduce the load, but that's not the point. That can be done in MATLAB as well...
Ill break this down into different catagories because there is a lot to talk about.
Data Handling:
Matlab handles data more efficiently (in regards to memory) due to its internal optimizations. Python offers more flexibility but unless you have a beefed-out computer, it comes at the cost of your precious RAM especially when you use operations such as np.vstack
which creates arrays in memory.
Memory usage during plotting
Matplotlib
stores additional data when plotting such as unoptimized copies of data. You are able to optimize memory usage in Python. However, unlike MATLAB python isn't specialised to handle complex data generation in particular large memory allocations.
Looking around, I found a optimisation for python (as shown below):
data_list = []
for i in range(len(lgth)):
filename1 = f"{date}.00{lgth[i]}.hdf5"
with h5py.File(filename1, 'r') as f:
intD = np.array(f['/data']).T
data_list.append(intD)
Data = np.vstack(data_list)
You can also utilise memory profiling tools like memory_profiler
or traemalloc
to understand where your memory is boing consumed the most. It will pinpoint specific lines/functions that are contributing to the high memory usage.
However, Python is an object-orientated programming language. It is an easy-to-read and understand resulting in easier steps for developers. In real time Python isn't the best choice for you, It works well if you don't mind unoptimized options.