python-3.xnumpystructured-array

Numpy repack_fields with range or a list allocates memory


Issue:

I am trying to repack a subset of rows and fields from a large numpy structured array.

When I use a slice, I am able to use repack_fields, but when I use a range I am not. Calling range before repack_fields appears to be allocating all of the memory needed by the original array.

Example:

Below is an example where I limit the available memory in order to introduce the error I am observing my use case.

import numpy as np
from numpy.lib.recfunctions import repack_fields
import resource

resource.setrlimit(resource.RLIMIT_AS, (int(1.5e9), int(1.5e9)))

H = np.zeros(100, dtype=[('f1', int), ('f2', int), ('large', float, 1000000)])

print('Using slicing: ')
repack_fields(H[['f1', 'f2']][0:50])
print('Using range: ')
repack_fields(H[['f1', 'f2']][range(0, 50)])

produces the output:

Using slicing: 
Using range: 
Traceback (most recent call last):
  File "test.py", line 12, in <module>
    repack_fields(H[['f1', 'f2']][range(0, 50)])
MemoryError: Unable to allocate 381. MiB for an array with shape (50,) and data type {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}

Questions:

  1. Why is the behavior of range(0, 50) different than 0:50? (A list also doesn't work.) I know in the above example, one could repack the fields first, and then reference the rows. (That is, repack_fields(H[['f1', 'f2']])[range(0, 50)] works.) But I don't want to have to know whether it is better to get rows first or fields first.

  2. What is the correct way to take a subset of rows/fields from a large numpy structured array? (even when the rows are not consecutive)?


Solution

  • repack_fields(H[['f1', 'f2']][0:50])
    

    Both [['f1', 'f2']] and [0:50] produce a view, one because it's a multifield index, and the other because it's a slice (basic indexing). So that doesn't require new memory. repack_fields makes a new array with space for just those 2 fields, and 50 records, and copies values from the view.

    repack_fields(H[['f1', 'f2']][range(0, 50)])
    

    Again the fields index is a view, referencing the whole structure, including the large field. [range...] is advanced indexing, making a copy that includes 50 records of 'large'.

    Look at the error:

     Unable to allocate 381. MiB for an array with shape (50,) and data type
     {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}
    
     In [336]: 50*8000016/1e6
     Out[336]: 400.0008
    

    There's the 381 MB that it's trying to allocate. The error occurs in the [range(50)] indexing. It hasn't gotten to the repack yet.

    So you have to understand two things.