regexbashdiacritics

Bash regex vs. diacritics


I do some regex checking in Bash to make sure that a string contains only sane characters (only lowercase a-z in this case) and I encountered this strange behavior.

It looks the same in grep and sed. Python 3.9 behaves as I would expect it.

Am I doing something wrong or is it a bug? If it is a bug, where to report it?


Lowercase š is wrongly detected as a character between a-z:

[[ 'š' =~ ^[a-z]$ ]] && echo sane || echo nope
sane

[[ 'š' =~ [a-z] ]] && echo sane || echo nope
sane

grep '^[a-z]$' <<<'š' && echo sane || echo nope
š
sane

sed 's/^[a-z]$/a/' <<<'š'
a

Lowercase ž is correctly detected as not a character between a-z: EDIT: Because ž goes right after z - that is outside of a-z.

[[ 'ž' =~ ^[a-z]$ ]] && echo sane || echo nope
nope

[[ 'ž' =~ [a-z] ]] && echo sane || echo nope
nope

grep '^[a-z]$' <<<'ž' && echo sane || echo nope
nope

sed 's/^[a-z]$/a/' <<<'ž'
ž

Capital Š is correctly detected as not a character between a-z:

[[ 'Š' =~ ^[a-z]$ ]] && echo sane || echo nope
nope

[[ 'Š' =~ [a-z] ]] && echo sane || echo nope
nope

grep '^[a-z]$' <<<'Š' && echo sane || echo nope
nope

sed 's/^[a-z]$/a/' <<<'Š'
Š

Capital Š is wrongly detected as a character between A-Z:

[[ 'Š' =~ ^[A-Z]$ ]] && echo sane || echo nope
sane

[[ 'Š' =~ [A-Z] ]] && echo sane || echo nope
sane

grep '^[A-Z]$' <<<'Š' && echo sane || echo nope
Š
sane

sed 's/^[A-Z]$/A/' <<<'Š'
A

My bash version:

GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)

My grep version:

grep (GNU grep) 3.6

My sed version:

sed (GNU sed) 4.8

My locale:

locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Python:

python3 -c 'import re ;print("sane" if re.match(r"^[a-z]$", "š") else "nope")'
nope

python3 -c 'import re ;print("sane" if re.match(r"^[a-z]$", "s") else "nope")'
sane

EDIT:

As @oguz-ismail pointed out, ž was just a badly chosen outlier (literally) as it goes after z. The behavior looks consistent with characters between a-z in alphabetical order - like š and č. And to get rid of them all, I had to set LC_ALL=C.

[[ 'č' =~ ^[a-z]$ ]] && echo sane || echo nope
sane

LC_CTYPE=C
# or stronger: LC_ALL=C

[[ 'č' =~ ^[a-z]$ ]] && echo sane || echo nope
nope

My last questio is whether it is expected to match letters with diacritics with the [a-z] range. (Definitely not expected by me.)


Solution

  • Am I doing something wrong or is it a bug?

    You are doing something that is locale-sensitive, and whose behavior may not be specified by POSIX. The observed behavior probably is not buggy.

    Bash's pattern matching operator uses the POSIX flavor of regular expressions, and POSIX leaves the behavior of range expressions inside character classes unspecified except in the POSIX locale. In the POSIX locale (and maybe elsewhere), the meaning of a range expression depends on the collation order in effect. It is my understanding that in locales for languages and regions where letters with diacritical marks are in common use, such letters are often collated together with the corresponding base letter. The behavior you describe is consistent with such a collation order.

    If you want to match against the characters mapped by ASCII (and Unicode) to code points 0x61 - 0x7A, and only those, regardless of locale, then the most reliable way to spell that is to list all the matching characters individually:

    [[ 'š' =~ ^[abcdefghijklmnopqrstuvwxyz]$ ]] && echo sane || echo nope