I am attempting to invert (mod 2) a non-sparse binary matrix. I am trying to do it with very very large matricies, like 100_000 by 100_000. I have tried libraries, such as sympy, numpy, numba. Most of these do this with floats and don't apply mod 2 at each step. In general, numpy reports the determinant of a random binary invertible array of size 1000 by 1000 as INF.
Here is my attempted code, but this lags at 10_000 by 10_000.
This function is also nice, since even if I am not invertible, I can see the rank.
def binary_inv(A):
n = len(A)
A = np.hstack((A.astype(np.bool_),np.eye(n,dtype=np.bool_)))
for i in range(n):
j = np.argmax(A[i:,i]) + i
if i!=j:
A[[i,j],i:] = A[[j,i],i:]
A[i,i] = False
A[A[:,i],i+1:] ^= A[i:i+1,i+1:]
return A[:,n:]
Any thoughts on making this as fast as possible?
First of all, the algorithmic complexity of a matrix inversion is identical to the complexity of a matrix multiplication. A naive matrix multiplication has a complexity of O(n**3)
. The best practical algorithm known so far is Strassen with a complexity of ~O(n**2.8)
(other can be considered as galactic algorithms). This complexity improvement comes with a bigger hidden constant so it only worth it for really large matrices already like one bigger than 10000x10000. On such big matrices, Strassen can certainly results in a speed up of about 2x to 4x (because Strassen on small matrices is not SIMD-friendly). This is far from being great considering that a code implementing Strassen efficiently is generally noticeably more complex than one implementing efficiently the naive algorithm (which is actually still used by most BLAS library nowadays). Actually, the biggest improvements that can be achieved come from basic (low-level) code optimizations, not algorithmic one.
One major optimization to perform is to pack boolean values into bits. This is not done by Numpy automatically (Numpy uses 1 byte per boolean value). This enable the CPU to operate on 8 times more items with a very small overhead. Besides, it also reduces the memory space used by the algorithm by the same amount so the operation becomes more cache friendly.
Another major optimization is to compute this operation with multiple thread. Indeed, modern mainstream personal CPUs often have 6~8 cores (significantly more on computing server) so you should expect the computation to be about 3~6 faster. This is not as good as the number of cores because Gauss-Jordan elimination does not scale well. Indeed, it tends to be memory-bound (this is why packing bits is actually so important).
Yet another optimization consists in computing the matrix block by block so to make it more cache-friendly. This kind of optimization resulted in a roughly twice faster code the last time I tested it (with relatively small blocks). In practice, here, it might be even faster since you operate on a GF(2)
field.
Applying such optimization on a Numpy code without introducing substantial overheads is difficult if even possible. I strongly advise you to write this in a native language like C, C++, or Rust. Numba can help but there are known performance issues (mainly vectorization) with packing/unpacking.
You should be careful to write a SIMD-friendly code because SIMD instructions can operate on much more items than scalar ones (e.g. the AVX2 instruction set can operate on 32 x 8-bit items = 256-bit at a time at a cost similar to scalar instructions).
If all of this is not enough, you can move this on GPUs and get again a significant improvement (let set 4~20 regarding the target GPU). Note you should pack boolean values into 32-bit integers to get good performances. On top of that, tensor cores might help to push this limit further if you succeed to leverage them. Note that applying some of the above optimization on GPU is actually pretty hard (especially blocking) even for skilled developers. Once optimized, I expect a very-good GPU implementation to take less than 10 minutes on a recent mid/high-end GPU.
All of this should be enough to make the operation realistically 2 to 3 orders of magnitude faster!