I have the 6-byte text file (input.txt
). It contains one line of three characters:
αβγ
The file is in UTF-8. It can be generated by the following command:
echo 'CEB1CEB2CEB3' | xxd -r -p > input.txt
I used the following command:
LC_ALL=en_US.utf8 gawk -F '\\|' 'BEGIN {IGNORECASE = 0};
{ t = gensub(/\262{1,}/, "", "g", $0);
print(t) > "output1.txt" }' input.txt
The output is (in hexadecimal):
CEB1CECEB30A
This is invalid UTF-8: the second character is corrupted because the B2
byte was deleted.
Then I used the following command:
LC_ALL=en_US.utf8 gawk -F '\\|' 'BEGIN {IGNORECASE = 1};
{ t = gensub(/\262{1,}/, "", "g", $0);
print(t) > "output2.txt" }' input.txt
The output is (in hexadecimal):
CEB1CEB2CEB30A
This is valid UTF-8: the text can be read with no errors.
What is the explanation of the fact that these two commands generate two different outputs? What is the effect of IGNORECASE
in this situation? Why is the B2
byte not deleted when the IGNORECASE
flag is active?
Assumptions/understandings:
IGNORECASE=0
and IGNORECASE=1
For discussion and testing purposes I'll use the following file:
$ cat input.txt
αβγ # lowercase
ΑΒΓ # uppercase
NOTE: file does not contain comments
Viewing individual bytes:
$ cat input.txt | od -c
0000000 316 261 316 262 316 263 \n 316 221 316 222 316 223 \n
^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
α β γ Α Β Γ
NOTE: OP's question switches back-n-forth between octal and hexidecimal; I'm going to stick with octal with the understanding similar results will be obtained when using hexidecimal
References:
Case is significant by default because IGNORECASE (like most variables) is initialized to zero
... snip ...
In multibyte locales, the equivalences between upper- and lowercase characters are tested based on the wide-character values of the locale’s character set.
This is telling us a) default behavior is IGNORECASE=0 and b) case insensitive testing of characters requires taking into consideration all bytes that make up the character.
IGNORECASE #
If IGNORECASE is nonzero or non-null, then all string comparisons and all regular expression matching are case-independent. This applies to regexp matching with ... gensub() ...
When combined with the previous quote this is telling us that when IGNORECASE=1 all regex comparisons (eg, 1st arg to gensub()
) are performed at the (multibyte) character level.
In OP's 1st example we see the default case-sensitive behavior of regex matching on a single byte and the removal of said byte:
$ export LC_ALL=en_US.utf8 # run once; applies to all follow-on commands
$ gawk 'BEGIN{IGNORECASE=0} {print gensub(/\262/,"","g")}' input.txt | od -c
0000000 316 261 316 316 263 \n 316 221 316 222 316 223 \n
^^^^^^^ ^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
α ? γ Α Β Γ
In OP's 2nd example the inclusion of IGNORECASE=1
causes gawk
to require all regex comparisons (eg, 1st arg to gensub()
) to be processed as characters. Since \262
is not a valid character in the utf-8 character set gensub()
finds no match so nothing is changed:
$ gawk 'BEGIN{IGNORECASE=1} {print gensub(/\262/,"","g")}' input.txt | od -c
0000000 316 261 316 262 316 263 \n 316 221 316 222 316 223 \n
^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
α β γ Α Β Γ
In order to generate the same output (as IGNORECASE=0
) we can override IGNORECASE=1
with the -b/--characters-as-bytes
flag thus (re)enabling regex processing at the byte level:
$ gawk -b 'BEGIN{IGNORECASE=1} {print gensub(/\262/,"","g")}' input.txt | od -c
0000000 316 261 316 316 263 \n 316 221 316 222 316 223 \n
^^^^^^^ ^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^ ^^^^^^^
α ? γ Α Β Γ
Another (correct?) approach would be to regex match on the complete multibyte character (\316\262
) and replace said match with the 'leftover' byte (\316
):
$ gawk 'BEGIN{IGNORECASE=1} {print gensub(/\316\262/,"\316","g")}' input.txt | od -c
0000000 316 261 316 316 263 \n 316 221 316 316 223 \n
^^^^^^^ ^^^ ^^^^^^^ ^^^^^^^ ^^^ ^^^^^^^
α ? γ Α ? Γ
Notice in this case that the regex character \316\262
matches on both the lowercase (\316\262
) and uppercase (\316\222
) characters; both characters are replaced with the standalone byte \316
!
Net results:
IGNORECASE=0
allows regex processing at the byte levelIGNORECASE=1
forces regex processing at the (multibyte) character level-b
overrides/negatesIGNORECASE=1
and returns us to regex processing at the byte levelIGNORECASE=0
+ gensub(/regex/)
== regex processing at the byte level vs IGNORECASE=1
+ gensub(/regex/)
== regex processing at the character level