How to exploit sparsity with lists in a Cython function

I have a function I need to implement which does not vectorize well, and so I am implementing it in Cython to make the nested for-loops workable. My problem is of a similar complexity as naive matrix-matrix multiplication, but at least one of the matrices is binary and often very sparse. The sparse matrix also changes throughout the program, so the simplest way I thought to implement it is with the nested list format (a list of all the non-zero indices).

I suspect my slowdown is due to the use of python lists, but since I have to remove and append to the inner lists I think this makes other formats (namely csr) less efficient. I've looked at other questions here regarding sparse matrices in Cython, and none seem to have the remove-and-append feature that is necessary for my application.

Here are the sparse and dense functions:

import numpy as np
cimport numpy as np
from libc.math cimport exp

def spgreedy(np.ndarray[double, ndim=2] J, 
             np.ndarray[double, ndim=1] h,
             list[list[int]] S,
             double temp,
             ):

    cdef int n, d 
    n = len(S)
    d = J.shape[0]

    cdef int j, k
    cdef double dot, curr, prob

    for s in S:
        for j in range(d):

            if j in s:
                s.remove(j)

            dot = h[j]
            for k in s:
                dot += J[j, k]

            curr = dot / temp

            if curr < -100:
                prob = 0.0
            elif curr > 100:
                prob = 1.0
            else:
                prob = 1.0 / (1.0 + exp(-curr))

            if np.random.rand() < prob:
                s.append(j)

    return S

def greedy(np.ndarray[double, ndim=2] J, 
           np.ndarray[double, ndim=1] h,
           np.ndarray[int, ndim=2] S,
           double temp,
           ):

    cdef int n, d 
    n = len(S)
    d = J.shape[0]

    cdef int i, j, k
    cdef double dot, curr, prob

    for i in range(n):
        for j in range(d):

            dot = h[j]
            for k in range(d):
                dot += J[j, k] * S[i,k]

            curr = dot / temp

            if curr < -100:
                prob = 0.0
            elif curr > 100:
                prob = 1.0
            else:
                prob = 1.0 / (1.0 + exp(-curr))

            S[i,j] = 1*(np.random.rand() < prob)

    return S

and when I compare the two on some random sparse data

d = 1000
n = 50

J = 1.0*np.random.choice([-1,0,1], size=(d,d), p=[0.05,0.9,0.05])
J = np.triu(J,1) + np.triu(J,1).T
h = -np.ones(d)

S = np.random.choice([0,1], size=(n, d), p=[0.95, 0.05])
Slist = [np.where(s)[0].tolist() for s in S]

t0 = time()
_ = greedy(J, h, S, 1e-5)
print(time() - t0)

t0 = time()
_ = spgreedy(J, h, Slist, 1e-5)
print(time() - t0)

I find that the dense version is almost an order of magnitude faster?

dense: 0.04369068145751953
sparse: 0.18976712226867676

I suspect that the discrepancy comes from the use of python lists, but I also feel kind of bad cycling through that inner for k in range(d) loop when the vast majority are contributing nothing to the sum! Maybe there's no way to take advantage of this given how dynamic it is, but I thought I'd ask in case there is another type which is better suited.

Solution

Several issue could explain this performance gap.

First of all, remove is expensive because it has to travel all items of the list, remove the target item and then shift all the following ones. It has a complexity of O(len(s)) here. The thing is the order does not seems to be important here. Thus, you can use another data structure for fast insertion/deletion. For example, a hash-set can be used to insert/remove items in a (amortised) constant time. Even better: you can design your own custom fast hash-set specifically designed for this use-case. Indeed, you can use 2 flat arrays containing indices (sparse part, containing indices to items of the second arrays) and values (dense part, containing the actual items currently stored in s). The two arrays can be preallocated (since the size of s is bounded by d). The index array enables you to know whether the item exist in constant time (just check if index[i] is a valid index). The value array enables you to iterate over the values of s efficiently. You can add a new item i in the data structure by first checking index[i] (technically you do not need that in this case since the item is supposed not to be in the data structure yet), then add an entry in the end of value array (you need to store the position of the last item with a preallocated array -- e.g. in a variable called lastEntryIndex), store the position of the added entry in index[i]. To remove an item, you need to check index[i], swap values[index[i]] with value[lastEntryIndex] and update index[lastEntryIndex] to the value index[i] before resetting index[i] to an invalid index. All of this is done in constant time. Moreover, iterating over the values can then be fast (as opposed to some hash-set implementations). Not to mention arrays are generally faster than lists (if there is no bound checking nor wrap-around enabled).

Moreover, the innermost loop of greedy can benefit from SIMD units of the CPU (depending on how you compile the program -- which is not mentioned in the question) while the one of sgreedy cannot. Indeed, the former can read contiguous arrays so a C compiler can generate vector instruction reading fixed-size blocks from memory multiply/add blocks together, before adding all items of the block together. On mainstream compiler targetting x86-64 CPUs, the default SIMD size is 128-bit (SSE2) so 2 double-precision numbers per vector register. Modern x86-64 CPU typically have 2 128-bit SIMD units. Most of them nowadays also have at least 2 256-bit SIMD units (which can be used with the compilation flag -mavx in GCC/Clang). This is not really possible to benefit from SIMD units in sgreedy because of the indexing. Technically, it could be possible with SIMD gather instructions on relatively-recent x86-64 CPUs, but they are not very efficient on most CPUs anyway (and the instruction set is certainly also not even enabled by default). This means the greedy innermost loop can be executed much more efficiently than the one in sgreedy. There is not much to do about this (though using gather instructions manually may help a bit to improve performance). This is the price to pay for sparse operations.

Finally, regarding the actual assembly code generated by the underlying C compiler, the innermost loop of sgreedy might be bound by the latency of addition operations (forming a long chain of dependent instructions). This typically happens if the compiler does not unroll the loop at all. Manual unrolling can be used to break the dependency chain and mitigate this latency issue. Here is an example:

i = 0
s_len = len(s)

# Compute items 2 per 2
while i + 1 < s_len:
    dot1 += J[j, s[i]]
    dot2 += J[j, s[i+1]]
    i += 2

# Last item
if i < s_len:
    dot1 += J[j, s[i]]

dot = dot1 + dot2

Be careful to the type if you use this code in Cython (e.g. i, s_len, dot1, etc. must be typed for sake of performance). You could compute items 4 by 4 to possibly get better performance (this is dependent of the target CPU and compilation flags). You should also disable wrap-around and bound checking to avoid expensive checks in this loop (at the expense of a possibly less-safe execution).

Moreover, note that the outermost loop can certainly benefit from multithreading (but be careful about possible race conditions).

Last but not least, please note that np.random.rand() can take a significant time in Cython (because it is not a native function but a pure-Python one) compared to a native random function. That being said, the target random function must be guaranteed to be thread-safe so to avoid race conditions. rand_r is an example of thread-safe native function. On my machine (AMD 5945WX CPU), np.random.rand() takes 210 ns/call while rand_r should be at least 10 times faster.