python-3.xnumpymatrix-multiplicationmodular-arithmetic

This (modulo 2) binary matrix multiplication algorithm seems to underperform. What can I do better?


The question has changed since its initial posting as I've chased down a few leads. At this point, I'd say that I'm really looking for the following answers:

  1. Can a significant amount of time be saved by replacing addition/multiplication followed by a modulo 2 operation with and/logical_xor (assuming that the total number of such operations is kept the same)? If not, then why not? ANSWER: some time can indeed be saved, but it's arguable whether that amount is "significant".

  2. Where can I read more about the specific approach taken by the BLAS matrix multiplication underlying numpy? Ideally, I'd like a source that doesn't require deciphering the FORTRAN code forged by the sages of the past. ANSWER: The original paper proposing the BLAS matrix multiplication algorithms used today can be found here.

I've left my question in its original form below for posterity.


The following are two algorithms for multiplying binary matrices (i.e. taking the "dot" product) modulo 2. The first ("default") approach just uses numpy matrix-multiplication, then reduces modulo 2. The second ("alternative") approach attempts to speed things up by replacing the addition operation with an xor operation.

import timeit
import numpy as np
import matplotlib.pyplot as plt

def mat_mult_1(A,B):
    return A@B%2

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A[:,:,None]&B[None,:,:],axis = 1)

Contrary to my expectations, the alternative approach seems to take about 4 times longer than the default for products of larger binary matrices. Why is that? Is there some way I could speed up my alternative approach?

Here's the script I used to test the above two methods

n_vals = np.arange(5,205,5)
times = []

for n in n_vals:
    s_1 = f"mat_mult_1(np.random.randint(2,size = ({n},{n}))\
                        ,np.random.randint(2,size = ({n},{n})))"
    s_2 = f"mat_mult_2(np.random.randint(2,size = ({n},{n})),\
                        np.random.randint(2,size = ({n},{n})))"

    times.append((timeit.timeit(s_1, globals = globals(), number = 100),
              timeit.timeit(s_2, globals = globals(), number = 100)))

and here are two plots of the results.

enter image description here


Minor updates:

I was able to test these out for larger matrices (up to 1000x1000) and get a better sense of the asymptotics here. It indeed seems to be the case that the "default" algorithm here is O(n2.7), whereas the alternative is the expected O(n3) (the observed slopes were 2.703 and 3.133, actually).

enter image description here

I also checked how the alternative algorithm compared to the following implementation of "schoolbook" matrix multiplication followed by a mod operation.

def mat_mult_3(A,B):
    return np.sum(A[:,:,None]*B[None,:,:],axis = 1)%2

I was very surprised to find that this also does better than the and/xor based method!

enter image description here

In response to Michael's comment, I replaced mat_mult_2 with the following:

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A.astype(bool)[:,:,None]  
            & B.astype(bool)[None,:,:],axis = 1).astype(int)

This arguably still places an undue burden of type conversion on the method, but sticking to multiplication between boolean matrices didn't significantly change performance. The result is that mat_mult_2 now (marginally) outperforms mat_mult_3, as expected.

enter image description here

In response to Harold's comment: another attempt to get the asymptotics of the @ method. My device doesn't seem to be able to handle multiplication with n much greater than 2000.

The observed slope here is 2.93.

enter image description here


Solution

  • It looks like I mostly answered my own question. Here's a summary of what I found.