pythonscipysparse-matrix

SciPy sparse matrix csr_matrix and csc_matrix functions - how much intermediate memory do they use?


For the scipy functions csr_matrix, and csc_matrix, and specifically the forms that take the row and column indices:

csr_matrix((data, (row_ind, col_ind)), [shape=(M, N)])
csc_matrix((data, (row_ind, col_ind)), [shape=(M, N)])

How much intermediate memory do they use? Presumably they have to use some in order to convert to the CSR/CSC representations.

The context is that calling these functions is using a lot of memory in a particular case, to the point it doesn't succeed because it runs out of memory, so am trying to reason about it. In total data + row_ind + col_ind takes 25GB in my particular example, and I have about 35GB memory remaining, but this doesn't seem enough to call either csr_matrix or csc_matrix.

The docs at https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html and https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html don't seem to give info on this.

Here is what I hope is a roughly equivalent bit of code that runs out of memory on my 60GB (Linux) system.

import numpy as np
from scipy.sparse import csc_matrix

num_values = 2500000000
output_matrix_size = 150000

matrix = csc_matrix(
    (
        np.zeros(num_values, dtype=np.float16),
        (
            np.zeros(num_values, dtype=np.int32),
            np.zeros(num_values, dtype=np.int32),
        ),
    ),
    shape=(output_matrix_size, output_matrix_size),
)

Solution

  • I've traced the operation through the sparse module. The construction works as follows:

    1. A COO matrix with duplicates is created
    2. The COO matrix is converted to CSC with duplicates
    3. The duplicates are summed up

    The following allocations are made:

    1. Optionally, not in this case: A converted copy of the indices if the index type is suboptimal for a COO matrix. The decision is made between int32 and int64. int32 is used here and is sufficient, so no copy is made
    2. Optionally, not in this case: A converted copy of the values is made if the dtype does not match the specified type
    3. A copy of the row indices and the data is made when converting to CSC format (with duplicates). The remaining metadata for the CSC format (mostly the indptr array) is also allocated
    4. A C++ vector<pair<Index, Value>> is allocated temporarily to sort row indices. Its maximum size is determined by the maximum number of nonzero entries per column. In this case this is the full data size since all nonzero entries are in one column
    5. A copy of the pruned, unique row indices and values is retained

    With your 2.5e9 initial values, point 3 takes 14 GiB (4 byte per row index, 2 byte per value) plus <2 MiB for the indptr array. Point 4 takes an additional 18.6 GiB (pair<int32, float16> takes 8 byte).

    As a workaround, I suggest batching the conversion. Something like this:

    batchsize = 524288 # 2 MiB divided by 4 byte. Just a guess
    values = np.zeros(num_values, dtype=np.float16)
    rows = cols = np.zeros(num_values, dtype=np.int32)
    shape = (output_matrix_size, output_matrix_size)
    out = csc_matrix(shape, dtype=values.dtype)
    for i in range(0, len(values), batchsize):
        out += csc_matrix((values[i:i+batchsize],
                           (rows[i:i+batchsize], cols[i:i+batchsize])),
                          shape=shape)
    

    An extended version of this may be designed as a parallelized reduction. Of course this only makes sense if the data set has a very large number of duplicates.

    Performance of this style of batching will be very poor if there are many nonzeros that are not duplicates because the output matrix grows in size while the number of added elements remains constant. Alternatively, it might be better to sum only matrices of similar size like a merge-sort pattern. In terms of space and time requirements, this falls somewhere between the two other approaches. Something like this:

    batches = (csc_matrix((values[i:i+batchsize],
                           (rows[i:i+batchsize], cols[i:i+batchsize]),
                          ), shape=shape)
               for i in range(0, len(values), batchsize))
    
    stack = [csc_matrix(shape, dtype=values.dtype)]
    for batch in batches:
        while stack and stack[-1].nnz <= batch.nnz * 2:
            batch += stack.pop()
        stack.append(batch)
    
    while len(stack) > 1:
        batch = stack.pop()
        stack[-1] += batch
    
    out = stack.pop()