I have some of these constructions in my code:
from pyparsing import Char, srange
_non_ascii = Char(srange('[\x80-\U0010FFFF]'))
The generation of the ranges is extremely slow, taking 6-8 seconds (huh?) even with Python 3.12 on a relatively decent machine.
Why is this happening and what should I replace those with?
This is what the srange()
function looks like (taken from GitHub):
_charRange = Group(_singleChar + Suppress("-") + _singleChar)
_reBracketExpr = (
Literal("[")
+ Opt("^").set_results_name("negate")
+ Group(OneOrMore(_charRange | _singleChar)).set_results_name("body")
+ Literal("]")
)
def srange(s: str) -> str:
_expanded = (
lambda p: p
if not isinstance(p, ParseResults)
else "".join(chr(c) for c in range(ord(p[0]), ord(p[1]) + 1))
)
try:
return "".join(_expanded(part) for part in _reBracketExpr.parse_string(s).body)
except Exception as e:
return ""
As you (or I) can see, pyparsing
is parsing the argument using _reBracketExpr
, which is itself a pyparsing
expression, then pass the result of that to _expanded
. _expanded
will create a new string if p
is a ParseResult
(range) or return p
unchanged otherwise (single character). In the end, all those strings are joined, creating an even bigger string.
In your (or my) case, each Char(srange())
creates a 0x10FFFF
- 0x80
+ 1
= ~1.1m
character string. Fortunately, since there is only one element, that string is reused instead of joined to create the eventual result. But that's not it; not yet. Word
, which gets passed the string as Char
's superclass, is also doing something on its own:
class Word(Token):
def __init__(self, init_chars: str = "", ..., *, initChars: str = "", ...):
initChars = initChars or init_chars
...
# One full iteration
initChars_set = set(initChars)
...
self.initChars = initChars_set
# O(n log n) sorting, then another join
self.initCharsOrig = "".join(sorted(initChars_set))
...
# __or__() creates a new set, which requires another full iteration.
if " " not in (self.initChars | self.bodyChars):
if len(self.initChars) == 1:
...
else:
# New string, again.
# This function is slow too, explained below
re_leading_fragment = f"[{_collapse_string_to_ranges(self.initChars)}]"
if self.bodyChars == self.initChars:
...
else:
...
# Yet another new string
self.reString = f"{re_leading_fragment}{re_body_fragment}{repeat}"
...
self.re = re.compile(self.reString)
...
def _collapse_string_to_ranges(
s: Union[str, Iterable[str]], re_escape: bool = True
) -> str:
...
ret = []
# Haven't we been here before?
s = "".join(sorted(set(s)))
if len(s) > 3:
# Yet another full iteration
for _, chars in itertools.groupby(s, key=is_consecutive):
... # Itertools dark magic to append to ret.
# Mind you, list.append() is amortized O(1).
else:
...
return "".join(ret) # A third .join()
I removed unreachable sections not applicable to the specific problem at hand (no as_keyword
and such) but some of them also create new strings.
All in all, this is a lengthy and time-consuming process just to create a native re.Pattern
instance (anyone counted how many times it iterated the original 1.1m character string?). That said, I switched to Regex
which directly makes use of re
:
_any_char = Regex(r'[\x80-\U0010FFFF]') # With and without r are both fine.
No giant strings created, and is more or less the same what a normal Char
was doing.