I am trying to repack a subset of rows and fields from a large numpy structured array.
When I use a slice, I am able to use repack_fields
, but when I use a range
I am not. Calling
range
before repack_fields
appears to be allocating all of the memory needed by the original array.
Below is an example where I limit the available memory in order to introduce the error I am observing my use case.
import numpy as np
from numpy.lib.recfunctions import repack_fields
import resource
resource.setrlimit(resource.RLIMIT_AS, (int(1.5e9), int(1.5e9)))
H = np.zeros(100, dtype=[('f1', int), ('f2', int), ('large', float, 1000000)])
print('Using slicing: ')
repack_fields(H[['f1', 'f2']][0:50])
print('Using range: ')
repack_fields(H[['f1', 'f2']][range(0, 50)])
produces the output:
Using slicing:
Using range:
Traceback (most recent call last):
File "test.py", line 12, in <module>
repack_fields(H[['f1', 'f2']][range(0, 50)])
MemoryError: Unable to allocate 381. MiB for an array with shape (50,) and data type {'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}
Why is the behavior of range(0, 50)
different than 0:50
? (A list also doesn't work.) I know in the above example, one could repack the fields first, and then reference the rows. (That is, repack_fields(H[['f1', 'f2']])[range(0, 50)]
works.) But I don't want to have to know whether it is better to get rows first or fields first.
What is the correct way to take a subset of rows/fields from a large numpy structured array? (even when the rows are not consecutive)?
repack_fields(H[['f1', 'f2']][0:50])
Both [['f1', 'f2']]
and [0:50]
produce a view
, one because it's a multifield index, and the other because it's a slice (basic indexing). So that doesn't require new memory. repack_fields
makes a new array with space for just those 2 fields, and 50 records, and copies values from the view
.
repack_fields(H[['f1', 'f2']][range(0, 50)])
Again the fields index is a view
, referencing the whole structure, including the large
field. [range...]
is advanced indexing, making a copy that includes 50 records of 'large'.
Look at the error:
Unable to allocate 381. MiB for an array with shape (50,) and data type
{'names':['f1','f2'], 'formats':['<i8','<i8'], 'offsets':[0,8], 'itemsize':8000016}
In [336]: 50*8000016/1e6
Out[336]: 400.0008
There's the 381 MB that it's trying to allocate. The error occurs in the [range(50)]
indexing. It hasn't gotten to the repack yet.
So you have to understand two things.
Indexing with a slice makes a view
that does not consume added memory. Indexing with range
or list (or array) is 'advanced indexing' and makes a copy.
multifield indexing makes a view
. Even if the fields are a subset of the source, the itemsize is still the original size. The purpose of repack
is to make a new array with new dtype and new itemsize, containing just the values for the selected fields.