I'm trying to maximize the speed of my Cython 3.0 code that involves updating an array with a loop of several array operations (including matrix-vector multiplication, vector-vector addition, and scalar-vector multiplication).
I have been using typed memoryviews to pass my array objects in/out of functions and have been using BLAS (scipy.linalg.cython_blas) to efficiently do the actual array operations. My current working code is faster than a Numba implementation for large array sizes, but is slower than Numba for small array sizes (speeds in graph below).
It seems that most of the computing time of my Cython code for small array sizes is spent making copies of typed memoryview arrays. However, I don't know how to do these operations without making copies. My code is below.
Python scratch code:
dim = ... # dimension of vectors/matrices
H = ... # Hermitian matrix, size dim x dim
c = ... # complex vector, size dim
dt = ... # double scalar
def f(H, c, dt):
q = c.real
p = c.imag
p -= 0.5 * dt * H @ q
q += dt * H @ p
p -= 0.5 * dt * H @ q
c = q + 1j*p
return c
for i in range(100):
c = f(H, c, dt)
Numpy initialization:
import numpy as np
dim = ... # varied to test speed
H = np.zeros((dim,dim), dtype = np.complex_) # Hermitian matrix
for i in range(dim):
H[i,i] = np.random.rand()
for j in range(i+1,dim)
H[i,j] = np.random.rand() + 1j*np.random.rand()
H[j,i] = np.conj(H[i,j])
c = np.zeros(dim, dtype = np.complex_) # complex vector
c[0] = 1.0
dt = 0.001
c = main_numba(H, c, dt) # time this
c = main_cython(H, c, dt) # time this
Numba implementation:
from numba import jit
@jit(nopython=True)
def f_numba(H, c, dt):
q = np.real(c).astype(np.complex_) # real part of c
p = np.imag(c).astype(np.complex_) # imag part of c
p -= 0.5 * dt * H @ q
q += dt * H @ p
p -= 0.5 * dt * H @ q
return q+1j*p
@jit(nopython=True)
def main_numba(H, c, dt):
for i in range(100):
c = f_numba(H, c, dt) # update c 100 times
return c
Cython implementation:
import numpy as np
cimport numpy as cnp
cnp.import_array()
cimport cython
cimport scipy.linalg.cython_blas as blas
# return real part of memview array
cdef double complex[:] mvR(double complex[:] vec):
cdef double complex[:] vecR = vec.copy() # expensive copy
for i in range(vec.shape[0]):
vecR[i] = vec[i].real
return vecR
# return imag part of memview array
cdef double complex[:] mvI(double complex[:] vec):
cdef double complex[:] vecI = vec.copy() # expensive copy
for i in range(vec.shape[0]):
vecI[i] = vec[i].imag
return vecI
# return matrix product of Hermitian matrix on a vector using BLAS
cdef double complex[:] HxV(double complex[::1,:] mat, double complex[:] vec):
cdef double complex[:] vec_copy = vec.copy() # expensive copy
cdef char uplo = 'L'
cdef int dim = mat.shape[0]
cdef int inc = 1
cdef double complex alpha = 1.0
cdef double complex beta = 0.0
blas.zhemv(&uplo, &dim, &alpha, &mat[0,0], &dim, &vec[0], &inc, &beta, &vec_copy[0], &inc)
return vec_copy
# return vector sum of two vectors using BLAS
cdef double complex[:] VpV(double complex[:] vec1, double complex[:] vec2):
cdef double complex[:] vec2_copy = vec2.copy() # expensive copy
cdef int dim = vec1.shape[0]
cdef int inc = 1
cdef double complex alpha = 1.0
blas.zaxpy(&dim, &alpha, &vec1[0], &inc, &vec2_copy[0], &inc)
return vec2_copy
# return product of scalar and a vector using BLAS
cdef double complex[:] SxV(double complex sca, double complex[:] vec):
cdef double complex[:] vec_copy = vec.copy() # expensive copy
cdef int dim = vec.shape[0]
cdef int inc = 1
blas.zscal(&dim, &sca, &vec_copy[0], &inc)
return vec_copy
# do matrix operations
cdef double complex[:] f_cython(double complex[::1,:] H, double complex[:] c_in, double complex dt):
cdef double complex[:] c_out = c_in.copy() # expensive copy
cdef double complex[:] q = mvR(c_in) # real part of c_in
cdef double complex[:] p = mvI(c_in) # imag part of c_in
p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
q[:] = VpV(SxV(dt, HxV(H, p)), q)
p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
c_out[:] = VpV(q, SxV(1.0j, p)) # turn q and p back into c
return c_out
# create memview arrays from numpy arrays, update c 100 times, return as numpy array
cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
cdef double complex[::1,:] H = np.asfortranarray(H_in)
cdef double complex[:] c = c_in
for i in range(100):
c[:] = f_cython(H, c, dt) # update c 100 times
return np.asarray(c)
I timed these implementations for array dimensions from 2 to 512, as shown below
The Numpy implementation is similar to the Numba implementation just without jit. The Cython (no copy) is the Cython implementation but the array copying has been commented out (the lines which say # expensive copy
). As a result, the Cython (no copy) gives wrong answers.
As the graph shows, Cython is 6 times slower than Numba for dim = 2
but 4 times faster for dim = 512
. More importantly, when the copying has been removed from Cython, its small dim
speed is equal to or better than Numba. This shows that most of the overhead of Cython for small arrays sizes is spent doing the expensive copying.
My issue with trying to remove the copying from Cython is that my array operations in f_cython
need to return the result of the operation without changing the values of the inputted arrays. I don't know how to do this without creating a copy of one of the arrays so that it can be overwritten and returned (while still providing the needed input values which is required for VpV and SxV). Yet, Numba appears to be able to do its operations without this copying overhead which is evidenced by Numba's speed being identical to Cython (no copy) for small array sizes.
My questions are:
How can I reduce the overhead cost of my Cython implementation to be more like Cython (no copy)?
Are there any other ways to significantly increase the speed of my Cython array operations (without rewriting everything in pure C/C++)?
I've updated my Cython implementation, building off of DavidW's answer involving utilizing in-place operations to reduce copy frequency. I've managed to reduce the number of copies from 14 copies to only 2 copies.
Updated Cython implementation:
cdef void mvR2(double complex[:] vec):
for i in range(vec.shape[0]):
vec[i] = vec[i].real
return
cdef void mvI2(double complex[:] vec):
for i in range(vec.shape[0]):
vec[i] = vec[i].imag
return
cdef double complex[:] HxV2(double complex[::1,:] mat, double complex[:] vec, double complex[:] tmp):
cdef char uplo = 'L'
cdef int dim = mat.shape[0]
cdef int inc = 1
cdef double complex alpha = 1.0
cdef double complex beta = 0.0
blas.zhemv(&uplo, &dim, &alpha, &mat[0,0], &dim, &vec[0], &inc, &beta, &tmp[0], &inc)
return tmp
cdef double complex[:] VpV2(double complex[:] vec1, double complex[:] vec2):
cdef int dim = vec1.shape[0]
cdef int inc = 1
cdef double complex alpha = 1.0
blas.zaxpy(&dim, &alpha, &vec1[0], &inc, &vec2[0], &inc)
return vec2
cdef double complex[:] SxV2(double complex sca, double complex[:] vec):
cdef int dim = vec.shape[0]
cdef int inc = 1
blas.zscal(&dim, &sca, &vec[0], &inc)
return vec
cdef double complex[:] f_cython2(double complex[::1,:] H, double complex[:] c, double complex dt):
cdef double complex[:] q = c # to be real'd
cdef double complex[:] p = c.copy() # to be imag'd, expensive copy
mvR2(q) # convert to real part
mvI2(p) # convert to imag part
cdef double complex[:] tmp = q.copy() # temp array, expensive copy (only shape allocation matters)
VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
VpV2(SxV2(dt, HxV2(H, p, tmp)), q)
VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
VpV2(SxV2(1.0j, p), q)
return c
cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython2(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
cdef double complex[::1,:] H = np.asfortranarray(H_in)
cdef double complex[:] c = c_in
for i in range(100):
f_cython2(H, c, dt) # update c 100 times
return np.asarray(c)
Here is the timing graph including the update:
The updated Cython implementation is now nearly as fast as Cython (no copy) and Numba for small arrays.
The 2 remaining copies are the only obvious overhead left, and the copy to make the tmp
array is not strictly necessary since tmp
only needs a shaped allocation of memory. However, using np.empty_like
, np.empty
, and even malloc
are all slower than using .copy()
to allocate tmp
for some unknown reason.
I've updated my Cython implementation again, this time by recognizing that I don't need to keep deallocating my arrays at the end of each loop which means I only have to perform 2 copies before the loop begins and none afterwards, effectively going from 200 total copies in my previous update to a total of 2 copies.
Updated Cython implementation (2nd update):
cdef void f_cython3(double complex[::1,:] H, double complex[:] q, double complex[:] p, double complex[:] tmp, double complex dt):
p[:] = q[:] # set p values to be complex c as well
mvR2(q) # convert to real part
mvI2(p) # convert to imag part
VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
VpV2(SxV2(dt, HxV2(H, p, tmp)), q)
VpV2(SxV2(-0.5*dt, HxV2(H, q, tmp)), p)
VpV2(SxV2(1.0j, p), q)
return
cpdef cnp.ndarray[cnp.complex_t, ndim=1] main_cython3(cnp.ndarray[cnp.complex_t, ndim=2] H_in, cnp.ndarray[cnp.complex_t, ndim=1] c_in, double complex dt):
cdef double complex[::1,:] H = np.asfortranarray(H_in)
cdef double complex[:] q = c_in # q is real'd at start of each loop cycle
cdef double complex[:] p = c_in.copy() # p is imag'd at start of each loop cycle
cdef double complex[:] tmp = c_in.copy() # keep this allocation - no need to deallocate after each loop cycle
for i in range(100):
f_cython3(H, q, p, tmp, dt) # update q and p which is actually updating c
return np.asarray(q) # q after the loop is actually complex c
The timing graph has been updated again:
The copying overhead is essentially gone even for small sizes, and this new update is faster than every other approach listed. In particular, this new Cython implementation is comprehensively faster than Numba for all array sizes. The new Cython code is 8 times faster for small arrays than the original Cython code.
I guess the moral of the story is, don't deallocate/reallocate when you don't need to.
I don't intend to give a full answer here because there's a lot of stuff to go through, but:
real
and imag
don't need to make a copy!np.real(arr)
What this actually returns is an array that shares data with arr
but only looks at the real
part. This lets you get rid of the "expensive copy". You probably could do this with Cython memoryviews and shifting pointer offsets, but frankly it's a lot easier just to call np.real
. This will be better than your Cython version because it won't make a copy.
(You may struggle with this if you want to convert these into complex
arrays though, since this will require making a copy somewhere. But even so, you seem to copy twice instead of once)
copy
for your output arraysA lot of the time you do cdef double complex[:] vec_copy = vec.copy() # expensive copy
you never use the copied data - you just want something of equivalent size to write into. Maybe try cdef double [:] vec_copy = np.empty_like(vec)
? This skips the cost of actually doing the copy.
cdef double complex[:] c_out = c_in.copy() # expensive copy
# ...
c_out[:] = VpV(q, SxV(1.0j, p)) # turn q and p back into c
Here you make a copy of some data in c_in
, , then return a memoryview from VpV
and copy that element-by-element into c_out
. This is a waste of time (twice). First that you are making a copy of c_in
that you never use. Second that VpV
returns a perfectly viable memoryview. You don't need to allocate memory and copy into it.
cdef double complex[:] c_out
c_out = VpV(q, SxV(1.0j, p)) # turn q and p back into c
This just redirects c_out
to point to the memoryview that was already created in VpV
and is really cheap because it's just a small amount of reference counting.
For example the line:
p[:] = VpV(SxV(-0.5*dt, HxV(H, q)), p)
can be split into (pseudocode):
tmp1 = HxV(H, q)
SxVInplaceModifyingTmp1(-0.5*dt, tmp1)
VpVInplaceModifyingP(tmp1, p) # since you immediately re-assign to `p`
Each of these in-place operations shouldn't need any copies