For the scipy functions csr_matrix, and csc_matrix, and specifically the forms that take the row and column indices:
csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
csc_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
How much intermediate memory do they use? Presumably they have to use some in order to convert to the CSR/CSC representations.
The context is that calling these functions is using a lot of memory in a particular case, to the point it doesn't succeed because it runs out of memory, so am trying to reason about it. In total data + row_ind + col_ind takes 25GB in my particular example, and I have about 35GB memory remaining, but this doesn't seem enough to call either csr_matrix or csc_matrix.
The docs at https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html and https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html don't seem to give info on this.
Here is what I hope is a roughly equivalent bit of code that runs out of memory on my 60GB (Linux) system.
import numpy as np
from scipy.sparse import csc_matrix
num_values = 2500000000
output_matrix_size = 150000
matrix = csc_matrix(
(
np.zeros(num_values, dtype=np.float16),
(
np.zeros(num_values, dtype=np.int32),
np.zeros(num_values, dtype=np.int32),
),
),
shape=(output_matrix_size, output_matrix_size),
)
I've traced the operation through the sparse module. The construction works as follows:
The following allocations are made:
indptr
array) is also allocatedvector<pair<Index, Value>>
is allocated temporarily to sort row indices. Its maximum size is determined by the maximum number of nonzero entries per column. In this case this is the full data size since all nonzero entries are in one columnWith your 2.5e9 initial values, point 3 takes 14 GiB (4 byte per row index, 2 byte per value) plus <2 MiB for the indptr
array. Point 4 takes an additional 18.6 GiB (pair<int32, float16>
takes 8 byte).
As a workaround, I suggest batching the conversion. Something like this:
batchsize = 524288 # 2 MiB divided by 4 byte. Just a guess
values = np.zeros(num_values, dtype=np.float16)
rows = cols = np.zeros(num_values, dtype=np.int32)
shape = (output_matrix_size, output_matrix_size)
out = csc_matrix(shape, dtype=values.dtype)
for i in range(0, len(values), batchsize):
out += csc_matrix((values[i:i+batchsize],
(rows[i:i+batchsize], cols[i:i+batchsize])),
shape=shape)
An extended version of this may be designed as a parallelized reduction. Of course this only makes sense if the data set has a very large number of duplicates.
Performance of this style of batching will be very poor if there are many nonzeros that are not duplicates because the output matrix grows in size while the number of added elements remains constant. Alternatively, it might be better to sum only matrices of similar size like a merge-sort pattern. In terms of space and time requirements, this falls somewhere between the two other approaches. Something like this:
batches = (csc_matrix((values[i:i+batchsize],
(rows[i:i+batchsize], cols[i:i+batchsize]),
), shape=shape)
for i in range(0, len(values), batchsize))
stack = [csc_matrix(shape, dtype=values.dtype)]
for batch in batches:
while stack and stack[-1].nnz <= batch.nnz * 2:
batch += stack.pop()
stack.append(batch)
while len(stack) > 1:
batch = stack.pop()
stack[-1] += batch
out = stack.pop()