regexbashwildcard

Are wildcards in tests always faster?


Testing with Bash specifically, I tried using wildcards (as if it was a case statement):

[[ ${var@a} == *"A"* ]]

It surprisingly (to me) works, but it's uglier than a regex. Then I compared them time-wise:

$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} == *"A"* ]] && :; done

real    0m2.512s
user    0m2.500s
sys 0m0.003s

$ time for ((i = 0 ; i < 1000000 ; i++ )); do [[ ${var@a} =~ "A" ]] && :; done

real    0m3.578s
user    0m3.553s
sys 0m0.003s

Is there any explanation why (this simple) regular expression is so much slower here? Is it that Bash regex implementation is slow per se?


Solution

  • Is it that Bash regex implementation is slow per se?

    Seems so, at least in comparison to globs (e.g. == *A*).
    However, I don't think bash is to blame here too much, because ...

    Regex matching is not implemented directly in bash

    For executing¹ a parsed [[ =~ ]] command, bash just calls the C standard library functions regcomp and regexc from <regex.h> here.

    That means the performance of =~ depends directly on your standard library, e.g. glibc in most cases. Maybe other implementations of the C standard library are faster (e.g. musl)? I don't know.

    There is a bit of overhead for turning =~ a"."b into POSIX extended regular expression a\.b when parsing the command. However, that isn't the problem here as confirmed by below test:

    time for ((i=0; i<1000000; i++)); do [[ t ]]; done                   # 1.6s
    time for ((i=0; i<1000000; i++)); do [[ a == *b* ]]; done            # 1.9s
    time for ((i=0; i<1000000; i++)); do [[ a =~ b ]]; done              # 2.8s
    time for ((i=0; i<1000000; i++)); do [[ (t || a =~ b) || t ]]; done  # 1.9s
    

    In the last command, bash has to parse a =~ b but does not execute it because t is true (like any other non-empty string) and the or-operator || short-circuits. Since the resulting time is roughly the same for [[ t ]] and [[ t || a =~ b ]]), the time required for parsing =~ is negligible.

    ¹ Just for context: Bash starts to evaluate both =~ and == in [[ here. The implementation behind == is here -- a rather convoluted implementation with > 300 lines of code. But I assume the implementation behind regcomp and regexc always has to be longer than that, see for instance regexc in glibc (don't forget to look into the intern functions too, e.g. re_search_internal).