I used re
to group a long string, and put the obtained substrings into a list based on conditions specified by a dict
to output. I found a significant improvement on the speed using Cython comparing to Python, but would expect a further improvement.
A simple code in temp_test.pyx
is:
cimport cython
import re
@cython.wraparound(False)
@cython.boundscheck(False)
def cleave(str long_str, str rules, dict term_char):
"""
Cleaves a long string into substrings.
"""
cdef:
object compiler = re.compile(rules)
Py_ssize_t i, ns
Py_ssize_t nk = 0
Py_ssize_t nq = <ssize_t> len(long_str)
int p, t
list split_str = compiler.split(long_str)
list substrs = nq * [None]
str s
ns = <ssize_t> len(split_str)
for i in range(ns):
if split_str[i] is not None:
s = split_str[i]
p = len(s)
if p == 1 and s in term_char:
t = term_char[s]
if t == 0:
substrs[nk] = split_str[i - 1]
else:
substrs[nk] = split_str[i] + split_str[i + 1]
nk += 1
return substrs[:nk]
with the test code:
from temp_test import cleave
from typing import Dict
def test_cleave():
long_str: str = r"ABCEDFRFR"
rules: str = r"([BD])"
term_char: Dict[str, int] = {"B": 0, "D": 1}
subs = cleave(long_str, rules, term_char)
print(subs)
After using cython -a
to annotate the temp_test.pyx
, I found nearly all lines were highlighted. Since the cleave
will be called large number of times, and there exist lots of assignments of strings from a list (i.e., split_str
) to another list (i.e., substrs
) in the function, I found there always exist checks like:
if (unlikely(__pyx_v_split_str == Py_None)) {
PyErr_SetString(PyExc_TypeError, "object of type 'NoneType' has no len()");
__PYX_ERR(0, 24, __pyx_L1_error)
}
__pyx_t_5 = __Pyx_PyList_GET_SIZE(__pyx_v_split_str); if (unlikely(__pyx_t_5 == ((Py_ssize_t)-1))) __PYX_ERR(0, 24, __pyx_L1_error)
__pyx_v_ns = ((Py_ssize_t)__pyx_t_5);
@cython.wraparound(False)
was highlighted too with lots of checks (too long not show here), is this because I check the key in dict? I.e., s in term_char
. Do we have a better alternative?I'm very often use memoryview
, malloc
, calloc
etc. for numeric operations and calculations, but not familiar with string operations using cython, any suggestions are very appreciated.
There's a few small things I can see which are likely making it worse:
if p == 1 and s in term_char:
t = term_char[s]
There you're effective looking up term_char[s]
twice. Maybe:
if p == 1:
t = term_char.get(s, None)
if t is None:
continue
or something like that.
int p, t
Typing t
is probably a pessimization. Essentially the only thing you do with t
is to get it out of a dict
and compare it with 0
.
If you just left t
untyped the getting it out of a dict
is free (a tiny bit of ref-counting) and comparing it with 0
is almost certainly pretty cheap (since all Python int(0)
are usually the same integer).
Typing it as int
means you get a Python object out of a dict
, then do a type-check and extract the integer value. The comparison with 0
might be slightly quicker, but overall it's likely slower.
(Typing p
is probably fine).
In contrast, what might be worthwhile is to rewrite
substrs[nk] = split_str[i] + split_str[i + 1]
as
substrs[nk] = s + <str>(split_str[i + 1])
(i.e. let it know both components are strings). This'll let it go down a slightly accelerated string concatenation route, rather than generic addition.
if (unlikely(__pyx_v_split_str == Py_None)) {
can be eliminated with @cython.nonecheck(False)
. Obviously you need to satisfy yourself that that's appropriate.
I suspect all of the above will be marginal improvements though, and you shouldn't expect any big improvements from this kind of optimization.