regexperformancesedsubstitution

What is causing the inefficiency of GNU sed using `.*` substitutions?


SLES features a supportconfig command that was found to be very slow when creating the "Updates". First examination showed that sed is using "100% CPU" for minutes, so I suspected something rather inefficient as sed is usually quite efficient.

Here is my isolated test case (the file was copied from a temporary file that supportconfig would create):

# time sed -i -e '   s/\(.*[P|p]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*[P|p]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;     s/\(.*[P|p]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;        s/\(.*[P|p]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;    s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g; s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;      s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;   s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;      s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;    s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;  s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;        s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;     s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g' /tmp/uuu

real    6m53.283s
user    6m45.909s
sys     0m0.129s
# wc /tmp/uuu
  12234  123937 2711538 /tmp/uuu

So it took almost 7 minutes to process a 2.7MB text file. The only special thing may be that the text lines are rather long.

I suspect that the .* is causing massive backtracking when matching, and maybe the programmer was just a bit lazy providing better regular expressions. While sed was running, I also did an strace, but that basically just showed brk system calls (memory allocation).

So what might cause the terrible performance, and is there a way to improve it?

I can't provide the original input file as it seems to contain URLs for non-free downloads, unfortunately. But I could try proposals for a different sed invocation on the file I saved.

sedversion being used is sed-4.2.2-7.3.1.x86_64.

Transformation for readability

Fixing the [P|p] to [Pp], a more readable version of the sed command will look like this:

sed -i -e '
s/\(.*[Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g'

So obviously the purpose is to remove sensitive information like passwords from the log file.

Note: it seems the code block does parse * as markup in some cases...

Trying a solution

Doing the modifications suggested in https://stackoverflow.com/a/77820459/6607497, I got these results:

### original version of regex
# /tmp/uuu.sh

real    6m39.697s
user    6m39.092s
sys     0m0.032s
### modified version of regex
# /tmp/uuu.sh

real    0m0.237s
user    0m0.226s
sys     0m0.009s

When reducing the original input file to a smaller test case (about 9kB), the effect is still there, but less dramatic:

### original regexes
# /tmp/uuu.sh /tmp/eeee

real    0m0.119s
user    0m0.115s
sys     0m0.004s
### improved regexes
# /tmp/uuu.sh /tmp/eeee

real    0m0.007s
user    0m0.007s
sys     0m0.000s

Here is the obfuscated and trimmed test input file as BASE64 (due to very long lines); pipe input into base64 -d | gzip -d to decode:

# gzip < /tmp/eeee | base64
H4sIAH/cp2UAA9VZbVPbuBb+vP4VZ9yZ7m6JnThQyk0J99JAu2x5yRDKbKfTySi2kmiQLY8kA7nL
7m+/R/ILSYAUmnRv6zKNrJfjo+ccPedIetZuf4KOiGOSRPC5/ajnmfMM6pmS9QFL6v+dpCmV4HmJ
SDyWaCpJqNkltTXeKB154ZiGFwpSKaIs1Mo5pUNJ1ZglI1BUXrKQws+9D739/iFLsuv+vpGRSqZo
v4fNVPaHQvZ7u93+bppyFhLNRKL6QbPf677sX29t9jc3fvadQ0EiI1LSVCimhZxARDTxfR8/mDex
RGnCOY0gJeEFGVFlWnsAN3B6O6p8buDAaJIQDsckpkVdVXzqcwPnVCpUHUu7MhznX1DwhijqeJ63
5t3z3Fd5b8fHPWt3S1XBYWuo0H96E6VpPKu44tSL0Xj4w+mIhBOsO8wLR7b6aTgETa+Rl3LjGVSF
bXNsbe9w3/ttF3uhfb2uELzUYUxmBRmnAes0cOs08BsbjWH3kjBOBowzPYH9a00TC33QBBRqdfBf
Gi3u0cHigDoUCqDj5UrYyp5xxC/rkDsuoOOC6T/tuAt1+EjVDA69GRTKynlAF+pQfO5BWzyIg/Ps
H6MGz7uOucj0LUds/3soxIBIZ1tpSUm842zHVClcs6AnKW27LBkKd+ceKnlOcCW/XpJPciH+dr34
6AOff5hzvjRyISVNDy4Q8ThT+vYVEqShtpsvChcuc2Zpu8acLqrDKbJK2224gJqFY1si0hTy+Zkh
SSRk281957ADz7l+PdY6Va16/erqyleZon4o4vrzkX7tgsrQD+SkHPD0RedalPLxS1imdYcd3NJr
OEtwzuY7IwSLygoG1H4spM4xw/HlcBeGnFwaFLonJ4cuMDWw/YeEK+re2qas2dk2oA05G6IxWUz7
GhF/td5YX29uNRBhTa+xptlobnhBw1sPzhpBq9HAv7VG4NZ3tkvF8q/u0BKh7fpcy3ZEVShZaua/
82TA0X+mx2/XC3zuuE/JZ9+DA32ZMVfoQHPU/nQP8m4l/H+dqH6frxwcf/jj4UhwH8RP8ZjvyFtW
7hi9pXiltzpaqRyh8UibzyMkhkO0EhBATFNJx4YaLqlTRn6VMW06VREXBhnjGpCuCSiMTJjUhSKi
YFT3q2FnYwopJxrRjIFEEQZfRc1YHIJFSCiNFAyliKshGoeo2ExYaXxhiRdyRhMNETXxWoEWts+V
kDzK466qBscCB6XiCqeecRgj53mYUZivkySkdm6ZRm2rASZViQnGWGnydGVtr/z7M6RqUIEVSotx
/igaQ29sVNToBsrK1DQcJ4KL0aQaFVKp2bAiKRJKgQjoKYRqZmjVnxJM+svlxBTQShFEhCj1z9P2
Ckm52DfMrRnMT9TXsune+d7t0tEym105tuKHptI7m6oZcy5lzJm92RctNK/FYnPlsr1S9mOZ7kF7
LSK6nVyzxemRCSSpCRrOLUXNbk7HlKcKJiKDmI0kMXwBZNqVDV9NUUQhaMrqQcOSQBDYH4FrXIKy
cFr6un/xBM0aDEq6QNAvmc34yzwfrsYMCYFIChFToUhQrwz3AchAi9gqd8AaGtsMVi3kuCRCxuM1
oxESlJeManDw5gh+J5dk01IQUnqSxQNUWgwLMZwNJJEM1fjFODG9JnGKUIkU44TiXsP/l7/1q+/8
ZNCt5/D+NI/0bhgazi/4exZz5DeWhDyLcEYsMeDLR0wLZzWoLOtbU8aFDYmJZREzTG2Y2fhQOAmx
RY9JstB4hWimFeVDH7ol/qiW0RtDkiq/YoyRqQx9trSbytIUfR9nYWAiOo9IeiwphQklNrbkXXIy
KIxhUS+NAJuFsCvGOdK+RaRHU3QfY5NmI3hlNcGogTFyBvSHyaQ+tzGsl/vkfNt+TmUVmx6xdzfb
9tPuEe79NRyUq7YFXA2qY4BkyEaZzAW+ZQjWIqn2GIDq0B4B2P8wt0uGzicTmz87ccbNOUB+HNYu
FgdVren6Xy6oTCj/daazn1cqHJSjVct/vKAmsyQxmcAqFA4ljdDLGNKYH9WPO52pCgcTVWm5EF6c
7h+dnO/vwZuP6IHd7snpWefk+O3BuxdOigEdM5poYaevt5VMY/DOrdvjRBOKftDw1/31TW/dD7b8
DT+PEE7Pf+njc+Yb97PnMrjw6zl1RcVv3aTeYSHlgtKUcITbNwQt8VMz2vU00RmSzl6xEE3+9lZk
mOWs0u1kNpg0/cAzv5iJeUZBr9BwyWMpZc6lpmADz1N2Ts6nP11mbTxkVLqt2yyr5hbOh5U2z6q5
JgrjWxGGa24uAmtOi0BFI6w1TmIEPTW1qpmzOG9zowaBh3mGCC8o5udXTI/hQ8JZzAwdnTOpkajg
CLNKk3vXoCuZkOYgoDdFojVkaVw9uUHe0wmqhcHUZPZG/5knn4fUqk80NhpaMolZIwCbM+EffDjr
YC96nTJM/MtuzZdeYwuzq/lu01zerxDa7ZwdnO9jszkNw3fM67n7V20O/LtpyawVFtnAONOUHf76
vCQjTHmLY8/JE1wumaQtsGmOk0negnIrq8Kw2sp+AyrqfSsqyqmgr8UFTR7DWauc03Lb9tXz8Tez
19/ft65fZ4f1H8IOXze3jR9hbt86PVnmBqjYAZI8BuAmwCQMnsIuN7DLGVH330s96lniLraSsJ+Q
gbl8uYF33XfQMbdR9jLY3iph6QyDFDhL3bneuXpdXsLaolonKG4El+DUlUgo7jKxZNQqEb+tRYEA
TrO48J06ki/vJJ9gx+UlFLeec9qW5RuYEKWbK81yTZpQ5ikrWGo0uUQ1R5Km4DGzpbqeVO+X0Dna
Ozw43neOT/rd05M/Pra5CAkf42a2FjRf+Q38F9Re/P7h+P0Lx+QyfSugbYqY1eQNrfWguVV3hlWr
0zvpvO8VAh2FiaoqW0YiHaOjFG82O1ogMhFF4wKt/gfLVc0SqSIAAA==

Another Experiment

Inspired by https://stackoverflow.com/a/77833314/6607497, I tried to find out the actual timing differences, so I created a file with one million random printable ASCII characters (assuming it will be large enough to show a difference).

Not knowing what's actually in the file, I tried the "foo variants" using grep (assuming grep would show similar effects):

> time grep -q 'foo' /dev/shm/random.txt

real    0m0.005s
user    0m0.004s
sys     0m0.001s
> time grep -q '.*foo' /dev/shm/random.txt

real    0m0.005s
user    0m0.000s
sys     0m0.006s
> time grep -q 'foo.*' /dev/shm/random.txt

real    0m0.005s
user    0m0.004s
sys     0m0.000s
> time grep -q '.*foo.*' /dev/shm/random.txt

real    0m0.005s
user    0m0.004s
sys     0m0.001s

Surprisingly there was no difference (and "foo" is not present in the file). However there are matches for fo. (position:match) like this:

763:foH
3440:foY
13789:fo;
...
958405:foE
965938:foF
967142:foD

The complete list is:

> grep -o fo. /dev/shm/random.txt | tr '\n' ' '
foH foY fo; fof fom foa foj foV fo9 foy fo2 foP fo9 fo4 foC for fox foG foh foa foj fo} fod foq fo} foR fow fok foy foh fo> foy foK fo` foW fo2 foJ foL fo+ for fo~ foP fo` foq fo[ foi fos fo~ fo: fo. foy foD fo@ fo= fo+ foT foO fo] fok fo^ foa foh fod foA fo8 foJ foX fo` foN fo3 foS fo* fog foK fo6 fof fo4 fog foE fo* foO fo] foB fo< fo# fo" fow fo8 fo> foA fo+ foO fog foj foD fo: for fo1 fo. fo' foi foO foW foE foF foD >

So I tried to use sed actually:

> time sed -n -e 's/foo/XXX/p' /dev/shm/random.txt

real    0m0.005s
user    0m0.001s
sys     0m0.004s
> time sed -n -e 's/.*foo/XXX/p' /dev/shm/random.txt

real    1m20.032s
user    1m19.818s
sys     0m0.016s
> time sed -n -e 's/foo.*/XXX/p' /dev/shm/random.txt

real    0m0.006s
user    0m0.000s
sys     0m0.006s
> time sed -n -e 's/.*foo.*/XXX/p' /dev/shm/random.txt

real    1m19.490s
user    1m19.425s
sys     0m0.017s

Hunting for the constant string showed the same performance as grep had, but prefixing .* showed a performance of almost 80000 times worse! In contrast suffixing .* had almost no effect, and using both, prefix and suffix .* is basically the same as just using the prefix variant.

Versions being used: grep-2.16-4.6.1.x86_64 sed-4.2.2-7.3.1.x86_64

Not confusing enough?

Then I had the idea to quantify the performance hit of using prefix .*, so I decided to truncate the random file to just using the first 1000 characters:

> head -c1000 /dev/shm/random.txt >/dev/shm/random-1000.txt
> time sed -n -e 's/foo/XXX/p' /dev/shm/random-1000.txt

real    0m0.001s
user    0m0.001s
sys     0m0.000s
> time sed -n -e 's/.*foo/XXX/p' /dev/shm/random-1000.txt

real    0m0.001s
user    0m0.001s
sys     0m0.000s
> time sed -n -e 's/foo.*/XXX/p' /dev/shm/random-1000.txt

real    0m0.001s
user    0m0.001s
sys     0m0.000s
> time sed -n -e 's/.*foo.*/XXX/p' /dev/shm/random-1000.txt

real    0m0.003s
user    0m0.003s
sys     0m0.000s

To my big surprise, there was no noticeable difference now!?

I was thinking:

So I repeated the tests after having added an end-of-line character: Still it was about the same fast performance!

Then I re-did the experiments with sizes of 10000 and 100000 characters: For 10000 characters the .*prefix was only 9 times slower than the 1000 character variant, but the 100000 characters was about 90 times slower than as the 10000 character variant.

So my conclusion is that sed has some fixed internal "fast buffer". Is that correct?


Solution

  • As suspected, those .* matches were very bad for performance:

    After removing those, the performance did improve enormously (like factor 1500):

    real    0m0.230s
    user    0m0.225s
    sys     0m0.005s
    

    For reference this is the canonical command used (I also dropped -i to preserve the input file for repeatable tests):

    time sed -e '
    s/\([Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\([Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\([Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\([Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\(PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
    s/\(_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
    s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
    s/\(password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\(password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
    s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g
    ' "${1:-/tmp/uuu}" >"${2:-/dev/null}"