Testing with Bash specifically, I tried using wildcards (as if it was a case
statement):
[[ ${var@a} == *"A"* ]]
It surprisingly (to me) works, but it's uglier than a regex. Then I compared them time-wise:
$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} == *"A"* ]] && :; done
real 0m2.512s
user 0m2.500s
sys 0m0.003s
$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} =~ "A" ]] && :; done
real 0m3.578s
user 0m3.553s
sys 0m0.003s
Is there any explanation why (this simple) regular expression is so much slower here? Is it that Bash regex implementation is slow per se?
Is it that Bash regex implementation is slow per se?
Seems so, at least in comparison to globs (e.g. == *A*
).
However, I don't think bash is to blame here too much, because ...
For executing¹ a parsed [[ =~ ]]
command, bash just calls the C standard library functions regcomp
and regexc
from <regex.h>
here.
That means the performance of =~
depends directly on your standard library, e.g. glibc in most cases. Maybe other implementations of the C standard library are faster (e.g. musl)? I don't know.
There is a bit of overhead for turning =~ a"."b
into POSIX extended regular expression a\.b
when parsing the command. However, that isn't the problem here as confirmed by below test:
time for ((i=0; i<1000000; i++)); do [[ t ]]; done # 1.6s
time for ((i=0; i<1000000; i++)); do [[ a == *b* ]]; done # 1.9s
time for ((i=0; i<1000000; i++)); do [[ a =~ b ]]; done # 2.8s
time for ((i=0; i<1000000; i++)); do [[ (t || a =~ b) || t ]]; done # 1.9s
In the last command, bash has to parse a =~ b
but does not execute it because t
is true (like any other non-empty string) and the or-operator ||
short-circuits. Since the resulting time is roughly the same for [[ t ]]
and [[ t || a =~ b ]]
), the time required for parsing =~
is negligible.
¹ Just for context: Bash starts to evaluate both =~
and ==
in [[
here.
The implementation behind ==
is here -- a rather convoluted implementation with > 300 lines of code. But I assume the implementation behind regcomp
and regexc
always has to be longer than that, see for instance regexc
in glibc (don't forget to look into the intern functions too, e.g. re_search_internal
).