This boggles my mind, is this a known bug or am I missing something? If a bug is there a way to circumvent it?
Suppose I have a relatively small binary (0/1) n x q scipy.sparse.csr_matrix
, as in:
import numpy as np
from scipy import sparse
def get_dummies(vec, vec_max):
vec_size = vec.size
Z = sparse.csr_matrix((np.ones(vec_size), (np.arange(vec_size), vec)), shape=(vec_size, vec_max), dtype=np.uint8)
return Z
q = 100
ns = np.round(np.random.random(q)*100).astype(np.int16)
Z_idx = np.repeat(np.arange(q), ns)
Z = get_dummies(Z_idx, q)
Z
<5171x100 sparse matrix of type '<class 'numpy.uint8'>'
with 5171 stored elements in Compressed Sparse Row format>
Here Z
is a standard dummy variables matrix, with n=5171 observations and q=100 variables:
Z[:5, :5].toarray()
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]], dtype=uint8)
E.g., if the first 5 variables have...
ns[:5]
array([21, 22, 37, 24, 99], dtype=int16)
frequencies, we would also see this in Z
's column sums:
Z[:, :5].sum(axis=0)
matrix([[21, 22, 37, 24, 99]], dtype=uint64)
Now, as expected, if I multiply Z.T @ Z
I should get a q x q diagonal matrix, on the diagonal the frequencies of the q variables:
print((Z.T @ Z).shape)
print((Z.T @ Z)[:5, :5].toarray()
(100, 100)
[[21 0 0 0 0]
[ 0 22 0 0 0]
[ 0 0 37 0 0]
[ 0 0 0 24 0]
[ 0 0 0 0 99]]
Now for the bug: If n is really large (for me it happens around n = 100K already):
q = 1000
ns = np.round(np.random.random(q)*1000).astype(np.int16)
Z_idx = np.repeat(np.arange(q), ns)
Z = get_dummies(Z_idx, q)
Z
<495509x1000 sparse matrix of type '<class 'numpy.uint8'>'
with 495509 stored elements in Compressed Sparse Row format>
The frequencies are large, the sum of Z
's columns is as expected:
print(ns[:5])
Z[:, :5].sum(axis=0)
array([485, 756, 380, 87, 454], dtype=int16)
matrix([[485, 756, 380, 87, 454]], dtype=uint64)
But the Z.T @ Z
is messed up! In the sense that I'm not getting the right frequencies on the diagonal:
print((Z.T @ Z).shape)
print((Z.T @ Z)[:5, :5].toarray())
(1000, 1000)
[[229 0 0 0 0]
[ 0 244 0 0 0]
[ 0 0 124 0 0]
[ 0 0 0 87 0]
[ 0 0 0 0 198]]
Amazingly, there is some relation to the true frequencies:
import matplotlib.pyplot as plt
plt.scatter(ns, (Z.T @ Z).diagonal())
plt.xlabel('real frequencies')
plt.ylabel('values on ZZ diagonal')
plt.show()
What is going on?
I'm using standard colab:
import scipy as sc
print(np.__version__)
print(sc.__version__)
1.25.2
1.11.4
PS: Obviously if I wanted just Z.T @ Z
's output matrix there are easier ways to get it, this is a very simplified reduced problem, thanks.
You are using uint8
in get_dummies
print(Z.dtype, (Z.T @ Z)[:5, :5].dtype)
# uint8, uint8
So the result is overflowing. The pattern is because the result is the true result modulo 256. If you change the dtype to uint16
, the problem disappears.
The fact that sum
increases the precision of the dtype is a bit surprising, but it is documented and consistent with what NumPy sum
does.