pythonnumpydownsamplingdata-formatsbigdata

fast downsampling of huge matrix using python (numpy memmap, pytables or other?)


As part of my data processing I produce huge non sparse matrices in the order of 100000*100000 cells, which I want to downsample by a factor of 10 to reduce the amount of data. In this case I want to average over blocks of 10*10 pixels, to reduce the size of my matrix from 100000*100000 to 10000*10000.

What is the fastest way to do so using python? It does not matter for me if I need to save my original data to a new dataformat, because I have to do the downsampling of the same dataset multiple times.

Currently I am using numpy.memmap:

import numpy as np

data_1 = 'data_1.dat'
date_2 = 'data_2.dat'
lines = 100000
pixels = 100000
window = 10

new_lines = lines / window
new_pixels = pixels / window
dat_1 = np.memmap(data_1, dtype='float32', mode='r', shape=(lines, pixels))
dat_2 = np.memmap(data_2, dtype='float32', mode='r', shape=(lines, pixels))

dat_in = dat_1 * dat_2
dat_out = dat_in.reshape([new_lines, window, new_pixels, window]).mean(3).mean(1)

But with with large files this method becomes very slow. Likely this has something to do with the binary data of these files, which are ordered by line. Therefore, I think that a data format which stores my data in blocks instead of lines will be faster, but I am not sure what the performance gain will be and whether there are python packages who support this.

I have also thought about downsampling of the data before creating such a huge matrix (not shown here), but my input data is fractured and irregular, so that would become very complex.


Solution

  • Based on this answer, I think this might be a relatively fast method, depending on how much overhead reshape gives you with memmap.

    def downSample(a, window):
         i, j = a.shape
         ir = np.arange(0, i, window)
         jr = np.arange(0, j, window)
         n = 1./(window**2)
         return n * np.add.reduceat(np.add.reduceat(a, ir), jr, axis=1)
    

    Hard to test speed without your dataset.