python-3.x numpy matrix-multiplication modular-arithmetic

This (modulo 2) binary matrix multiplication algorithm seems to underperform. What can I do better?

The question has changed since its initial posting as I've chased down a few leads. At this point, I'd say that I'm really looking for the following answers:

Can a significant amount of time be saved by replacing addition/multiplication followed by a modulo 2 operation with and/logical_xor (assuming that the total number of such operations is kept the same)? If not, then why not? ANSWER: some time can indeed be saved, but it's arguable whether that amount is "significant".
Where can I read more about the specific approach taken by the BLAS matrix multiplication underlying numpy? Ideally, I'd like a source that doesn't require deciphering the FORTRAN code forged by the sages of the past. ANSWER: The original paper proposing the BLAS matrix multiplication algorithms used today can be found here.

I've left my question in its original form below for posterity.

The following are two algorithms for multiplying binary matrices (i.e. taking the "dot" product) modulo 2. The first ("default") approach just uses numpy matrix-multiplication, then reduces modulo 2. The second ("alternative") approach attempts to speed things up by replacing the addition operation with an xor operation.

import timeit
import numpy as np
import matplotlib.pyplot as plt

def mat_mult_1(A,B):
    return A@B%2

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A[:,:,None]&B[None,:,:],axis = 1)

Contrary to my expectations, the alternative approach seems to take about 4 times longer than the default for products of larger binary matrices. Why is that? Is there some way I could speed up my alternative approach?

Here's the script I used to test the above two methods

n_vals = np.arange(5,205,5)
times = []

for n in n_vals:
    s_1 = f"mat_mult_1(np.random.randint(2,size = ({n},{n}))\
                        ,np.random.randint(2,size = ({n},{n})))"
    s_2 = f"mat_mult_2(np.random.randint(2,size = ({n},{n})),\
                        np.random.randint(2,size = ({n},{n})))"

    times.append((timeit.timeit(s_1, globals = globals(), number = 100),
              timeit.timeit(s_2, globals = globals(), number = 100)))

and here are two plots of the results.

Minor updates:

I was able to test these out for larger matrices (up to 1000x1000) and get a better sense of the asymptotics here. It indeed seems to be the case that the "default" algorithm here is O(n^2.7), whereas the alternative is the expected O(n³) (the observed slopes were 2.703 and 3.133, actually).

I also checked how the alternative algorithm compared to the following implementation of "schoolbook" matrix multiplication followed by a mod operation.

def mat_mult_3(A,B):
    return np.sum(A[:,:,None]*B[None,:,:],axis = 1)%2

I was very surprised to find that this also does better than the and/xor based method!

In response to Michael's comment, I replaced mat_mult_2 with the following:

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A.astype(bool)[:,:,None]  
            & B.astype(bool)[None,:,:],axis = 1).astype(int)

This arguably still places an undue burden of type conversion on the method, but sticking to multiplication between boolean matrices didn't significantly change performance. The result is that mat_mult_2 now (marginally) outperforms mat_mult_3, as expected.

In response to Harold's comment: another attempt to get the asymptotics of the @ method. My device doesn't seem to be able to handle multiplication with n much greater than 2000.

The observed slope here is 2.93.

Solution

It looks like I mostly answered my own question. Here's a summary of what I found.

One way the method I proposed falls short of the numpy method is in its asymptotic complexity. Whereas my method follows the naive AKA "schoolbook" algorithm of matrix multiplication, numpy pulls its approach from the BLAS routines. My best guess is that numpy is using SGEMM method, which to my limited understanding based on some quick googling and article skimming seems to be a variant of the Strassen algorithm for matrix multiplication. So, where my method does O(n³) operations (for a product of two binary nxn matrices), numpy's method does O(n^2.8) (which is roughly borne out by my observations).
Another way my method falls short is the repeated implicit type conversions that occur when calling boolean methods on an array of integers. This can be avoided by using boolean arrays as the algorithm input.
The result, accounting for these discrepancies, is this: if the schoolbook algorithm is applied but addition and multiplication are replaced by XOR and AND, then (according to my trials) the computation time is reduced by about 20%. This isn't nothing, but less than I was expecting.