pythonscipysparse-matrix

Bug in large sparse CSR binary matrices multiplication result


This boggles my mind, is this a known bug or am I missing something? If a bug is there a way to circumvent it?


Suppose I have a relatively small binary (0/1) n x q scipy.sparse.csr_matrix, as in:

import numpy as np
from scipy import sparse

def get_dummies(vec, vec_max):
    vec_size = vec.size
    Z = sparse.csr_matrix((np.ones(vec_size), (np.arange(vec_size), vec)), shape=(vec_size, vec_max), dtype=np.uint8)
    return Z

q = 100
ns = np.round(np.random.random(q)*100).astype(np.int16)
Z_idx = np.repeat(np.arange(q), ns)
Z = get_dummies(Z_idx, q)
Z
<5171x100 sparse matrix of type '<class 'numpy.uint8'>'
    with 5171 stored elements in Compressed Sparse Row format>

Here Z is a standard dummy variables matrix, with n=5171 observations and q=100 variables:

Z[:5, :5].toarray()
array([[1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0]], dtype=uint8)

E.g., if the first 5 variables have...

ns[:5]
array([21, 22, 37, 24, 99], dtype=int16)

frequencies, we would also see this in Z's column sums:

Z[:, :5].sum(axis=0)
matrix([[21, 22, 37, 24, 99]], dtype=uint64)

Now, as expected, if I multiply Z.T @ Z I should get a q x q diagonal matrix, on the diagonal the frequencies of the q variables:

print((Z.T @ Z).shape)
print((Z.T @ Z)[:5, :5].toarray()
(100, 100)
[[21  0  0  0  0]
 [ 0 22  0  0  0]
 [ 0  0 37  0  0]
 [ 0  0  0 24  0]
 [ 0  0  0  0 99]]

Now for the bug: If n is really large (for me it happens around n = 100K already):

q = 1000
ns = np.round(np.random.random(q)*1000).astype(np.int16)
Z_idx = np.repeat(np.arange(q), ns)
Z = get_dummies(Z_idx, q)
Z
<495509x1000 sparse matrix of type '<class 'numpy.uint8'>'
    with 495509 stored elements in Compressed Sparse Row format>

The frequencies are large, the sum of Z's columns is as expected:

print(ns[:5])
Z[:, :5].sum(axis=0)
array([485, 756, 380,  87, 454], dtype=int16)
matrix([[485, 756, 380,  87, 454]], dtype=uint64)

But the Z.T @ Z is messed up! In the sense that I'm not getting the right frequencies on the diagonal:

print((Z.T @ Z).shape)
print((Z.T @ Z)[:5, :5].toarray())
(1000, 1000)
[[229   0   0   0   0]
 [  0 244   0   0   0]
 [  0   0 124   0   0]
 [  0   0   0  87   0]
 [  0   0   0   0 198]]

Amazingly, there is some relation to the true frequencies:

import matplotlib.pyplot as plt

plt.scatter(ns, (Z.T @ Z).diagonal())
plt.xlabel('real frequencies')
plt.ylabel('values on ZZ diagonal')
plt.show()

enter image description here

What is going on?

I'm using standard colab:

import scipy as sc

print(np.__version__)
print(sc.__version__)
1.25.2
1.11.4

PS: Obviously if I wanted just Z.T @ Z's output matrix there are easier ways to get it, this is a very simplified reduced problem, thanks.


Solution

  • You are using uint8 in get_dummies

    print(Z.dtype, (Z.T @ Z)[:5, :5].dtype)
    # uint8,  uint8
    

    So the result is overflowing. The pattern is because the result is the true result modulo 256. If you change the dtype to uint16, the problem disappears.

    The fact that sum increases the precision of the dtype is a bit surprising, but it is documented and consistent with what NumPy sum does.