numpycythonnumbablasmemoryview

Maximizing the speed of Cython array operations when using typed memory views and BLAS


I'm trying to maximize the speed of my Cython 3.0 code that involves updating an array with a loop of several array operations (including matrix-vector multiplication, vector-vector addition, and scalar-vector multiplication).

I have been using typed memoryviews to pass my array objects in/out of functions and have been using BLAS (scipy.linalg.cython_blas) to efficiently do the actual array operations. My current working code is faster than a Numba implementation for large array sizes, but is slower than Numba for small array sizes (speeds in graph below).

It seems that most of the computing time of my Cython code for small array sizes is spent making copies of typed memoryview arrays. However, I don't know how to do these operations without making copies. My code is below.

Python scratch code:

dim = ... # dimension of vectors/matrices
H = ... # Hermitian matrix, size dim x dim
c = ... # complex vector, size dim
dt = ... # double scalar

def f(H, c, dt):
    q = c.real
    p = c.imag
    p -= 0.5 * dt * H @ q
    q += dt * H @ p
    p -= 0.5 * dt * H @ q
    c = q + 1j*p
    return c

for i in range(100):
    c = f(H, c, dt)

Numpy initialization:

import numpy as np

dim = ... # varied to test speed

H = np.zeros((dim,dim), dtype = np.complex_) # Hermitian matrix
for i in range(dim):
    H[i,i] = np.random.rand()
    for j in range(i+1,dim)
        H[i,j] = np.random.rand() + 1j*np.random.rand()
        H[j,i] = np.conj(H[i,j])

c = np.zeros(dim, dtype = np.complex_) # complex vector
c[0] = 1.0

dt = 0.001

c = main_numba(H, c, dt) # time this

c = main_cython(H, c, dt) # time this

Numba implementation:

from numba import jit

@jit(nopython=True)
def f_numba(H, c, dt):
    q = np.real(c).astype(np.complex_) # real part of c
    p = np.imag(c).astype(np.complex_) # imag part of c
    p -= 0.5 * dt * H @ q
    q += dt * H @ p
    p -= 0.5 * dt * H @ q
    return q+1j*p

@jit(nopython=True)
def main_numba(H, c, dt):
    for i in range(100):
        c = f_numba(H, c, dt) # update c 100 times
    return c

Cython implementation:

import numpy as np
cimport numpy as cnp
cnp.import_array()
cimport cython 
cimport scipy.linalg.cython_blas as blas

# return real part of memview array
cdef double complex[:] mvR(double complex[:] vec):
    cdef double complex[:] vecR = vec.copy() # expensive copy
    for i in range(vec.shape[0]):
        vecR[i] = vec[i].real
    return vecR

# return imag part of memview array
cdef double complex[:] mvI(double complex[:] vec):
    cdef double complex[:] vecI = vec.copy() # expensive copy
    for i in range(vec.shape[0]):
        vecI[i] = vec[i].imag
    return vecI

# return matrix product of Hermitian matrix on a vector using BLAS
cdef double complex[:] HxV(double complex[::1,:] mat, double complex[:] vec):
    cdef double complex[:] vec_copy = vec.copy() # expensive copy
    cdef char uplo = 'L'
    cdef int dim = mat.shape[0]
    cdef int inc = 1
    cdef double complex alpha = 1.0
    cdef double complex beta = 0.0
    blas.zhemv(&uplo, &dim, &alpha, &mat[0,0], &dim, &vec[0], &inc, &beta, &vec_copy[0], &inc)
    return vec_copy

# return vector sum of two vectors using BLAS
cdef double complex[:] VpV(double complex[:] vec1, double complex[:] vec2):
    cdef double complex[:] vec2_copy = vec2.copy() # expensive copy
    cdef int dim = vec1.shape[0]
    cdef int inc = 1
    cdef double complex alpha = 1.0
    blas.zaxpy(&dim, &alpha, &vec1[0], &inc, &vec2_copy[0], &inc)
    return vec2_copy

# return product of scalar and a vector using BLAS
cdef double complex[:] SxV(double complex sca, double complex[:] vec):
    cdef double complex[:] vec_copy = vec.copy() # expensive copy
    cdef int dim = vec.shape[0]
    cdef int inc = 1
    blas.zscal(&dim, &sca, &vec_copy[0], &inc)
    return vec_copy

# do matrix operations
cdef double complex[:] f_cython(double complex[::1,:] H, double complex[:] c_in, double complex dt):
    cdef double complex[:] c_out = c_in.copy() # expensive copy
    cdef double complex[:] q = mvR(c_in) # real part of c_in
    cdef double complex[:] p = mvI(c_in) # imag part of c_in
    p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
    q[:] = VpV(SxV(dt, HxV(H, p)), q)
    p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
    c_out[:] = VpV(q, SxV(1.0j, p)) # turn q and p back into c
    return c_out

# create memview arrays from numpy arrays, update c 100 times, return as numpy array
cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
    cdef double complex[::1,:] H = np.asfortranarray(H_in)
    cdef double complex[:] c = c_in
    for i in range(100):
        c[:] = f_cython(H, c, dt) # update c 100 times
    return np.asarray(c)

I timed these implementations for array dimensions from 2 to 512, as shown below

Timings of implementations for different array size dim

The Numpy implementation is similar to the Numba implementation just without jit. The Cython (no copy) is the Cython implementation but the array copying has been commented out (the lines which say # expensive copy). As a result, the Cython (no copy) gives wrong answers.

As the graph shows, Cython is 6 times slower than Numba for dim = 2 but 4 times faster for dim = 512. More importantly, when the copying has been removed from Cython, its small dim speed is equal to or better than Numba. This shows that most of the overhead of Cython for small arrays sizes is spent doing the expensive copying.

My issue with trying to remove the copying from Cython is that my array operations in f_cython need to return the result of the operation without changing the values of the inputted arrays. I don't know how to do this without creating a copy of one of the arrays so that it can be overwritten and returned (while still providing the needed input values which is required for VpV and SxV). Yet, Numba appears to be able to do its operations without this copying overhead which is evidenced by Numba's speed being identical to Cython (no copy) for small array sizes.

My questions are:

  1. How can I reduce the overhead cost of my Cython implementation to be more like Cython (no copy)?

  2. Are there any other ways to significantly increase the speed of my Cython array operations (without rewriting everything in pure C/C++)?

UPDATE

I've updated my Cython implementation, building off of DavidW's answer involving utilizing in-place operations to reduce copy frequency. I've managed to reduce the number of copies from 14 copies to only 2 copies.

Updated Cython implementation:

cdef void mvR2(double complex[:] vec):
    for i in range(vec.shape[0]):
        vec[i] = vec[i].real
    return

cdef void mvI2(double complex[:] vec):
    for i in range(vec.shape[0]):
        vec[i] = vec[i].imag
    return

cdef double complex[:] HxV2(double complex[::1,:] mat, double complex[:] vec, double complex[:] tmp):
    cdef char uplo = 'L'
    cdef int dim = mat.shape[0]
    cdef int inc = 1
    cdef double complex alpha = 1.0
    cdef double complex beta = 0.0
    blas.zhemv(&uplo, &dim, &alpha, &mat[0,0], &dim, &vec[0], &inc, &beta, &tmp[0], &inc)
    return tmp

cdef double complex[:] VpV2(double complex[:] vec1, double complex[:] vec2):
    cdef int dim = vec1.shape[0]
    cdef int inc = 1
    cdef double complex alpha = 1.0
    blas.zaxpy(&dim, &alpha, &vec1[0], &inc, &vec2[0], &inc)
    return vec2

cdef double complex[:] SxV2(double complex sca, double complex[:] vec):
    cdef int dim = vec.shape[0]
    cdef int inc = 1
    blas.zscal(&dim, &sca, &vec[0], &inc)
    return vec

cdef double complex[:] f_cython2(double complex[::1,:] H, double complex[:] c, double complex dt):
    cdef double complex[:] q = c # to be real'd
    cdef double complex[:] p = c.copy() # to be imag'd, expensive copy
    mvR2(q) # convert to real part
    mvI2(p) # convert to imag part
    cdef double complex[:] tmp = q.copy() # temp array, expensive copy (only shape allocation matters)
    VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
    VpV2(SxV2(dt, HxV2(H, p, tmp)), q)
    VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
    VpV2(SxV2(1.0j, p), q)
    return c

cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython2(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
    cdef double complex[::1,:] H = np.asfortranarray(H_in)
    cdef double complex[:] c = c_in
    for i in range(100):
        f_cython2(H, c, dt) # update c 100 times
    return np.asarray(c)

Here is the timing graph including the update:

Updated timings

The updated Cython implementation is now nearly as fast as Cython (no copy) and Numba for small arrays.

The 2 remaining copies are the only obvious overhead left, and the copy to make the tmp array is not strictly necessary since tmp only needs a shaped allocation of memory. However, using np.empty_like, np.empty, and even malloc are all slower than using .copy() to allocate tmp for some unknown reason.

UPDATE 2

I've updated my Cython implementation again, this time by recognizing that I don't need to keep deallocating my arrays at the end of each loop which means I only have to perform 2 copies before the loop begins and none afterwards, effectively going from 200 total copies in my previous update to a total of 2 copies.

Updated Cython implementation (2nd update):

cdef void f_cython3(double complex[::1,:] H, double complex[:] q, double complex[:] p, double complex[:] tmp, double complex dt):
    p[:] = q[:] # set p values to be complex c as well
    mvR2(q) # convert to real part
    mvI2(p) # convert to imag part
    VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
    VpV2(SxV2(dt, HxV2(H, p, tmp)), q)
    VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
    VpV2(SxV2(1.0j, p), q)
    return

cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython3(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
    cdef double complex[::1,:] H = np.asfortranarray(H_in)
    cdef double complex[:] q = c_in # q is real'd at start of each loop cycle
    cdef double complex[:] p = c_in.copy() # p is imag'd at start of each loop cycle
    cdef double complex[:] tmp = c_in.copy() # keep this allocation - no need to deallocate after each loop cycle
    for i in range(100):
        f_cython3(H, q, p, tmp, dt) # update q and p which is actually updating c
    return np.asarray(q) # q after the loop is actually complex c

The timing graph has been updated again:

enter image description here

The copying overhead is essentially gone even for small sizes, and this new update is faster than every other approach listed. In particular, this new Cython implementation is comprehensively faster than Numba for all array sizes. The new Cython code is 8 times faster for small arrays than the original Cython code.

I guess the moral of the story is, don't deallocate/reallocate when you don't need to.


Solution

  • I don't intend to give a full answer here because there's a lot of stuff to go through, but:

    real and imag don't need to make a copy!

    np.real(arr)
    

    What this actually returns is an array that shares data with arr but only looks at the real part. This lets you get rid of the "expensive copy". You probably could do this with Cython memoryviews and shifting pointer offsets, but frankly it's a lot easier just to call np.real. This will be better than your Cython version because it won't make a copy.

    (You may struggle with this if you want to convert these into complex arrays though, since this will require making a copy somewhere. But even so, you seem to copy twice instead of once)

    Don't copy for your output arrays

    A lot of the time you do cdef double complex[:] vec_copy = vec.copy() # expensive copy you never use the copied data - you just want something of equivalent size to write into. Maybe try cdef double [:] vec_copy = np.empty_like(vec)? This skips the cost of actually doing the copy.

    Sometimes you don't need to copy at all

        cdef double complex[:] c_out = c_in.copy() # expensive copy
        # ...
        c_out[:] = VpV(q, SxV(1.0j, p)) # turn q and p back into c
    

    Here you make a copy of some data in c_in, , then return a memoryview from VpV and copy that element-by-element into c_out. This is a waste of time (twice). First that you are making a copy of c_in that you never use. Second that VpV returns a perfectly viable memoryview. You don't need to allocate memory and copy into it.

        cdef double complex[:] c_out
        c_out = VpV(q, SxV(1.0j, p)) # turn q and p back into c
    

    This just redirects c_out to point to the memoryview that was already created in VpV and is really cheap because it's just a small amount of reference counting.

    A lot of this stuff can be done in place

    For example the line:

    p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
    

    can be split into (pseudocode):

    tmp1 = HxV(H, q)
    SxVInplaceModifyingTmp1(-0.5*dt, tmp1)
    VpVInplaceModifyingP(tmp1, p)  # since you immediately re-assign to `p`
    

    Each of these in-place operations shouldn't need any copies