awk

GNU's AWK - Splitting Multibyte Character Strings w/Patsplit vs Split


I'd like to know if the following behavior is a reliable:

echo 😄 | gawk '{split($0,array,"."); print array[1] length(array);}'

Output is: 😄1

vs

echo 😄 | gawk '{patsplit($0,array,"."); print array[1] length(array);}'

Output is: �4

patsplit is working on bytes vs split on characters, but I haven't found this documented or discussed anywhere. Question is, can I count on this behavior and where?


Solution

  • This is because split and patsplit are fundamentally different functions.

    split divides a string into fields by a field separator, i.e. what's between the fields, while patsplit divides a string into fields by matching fields themselves with a field pattern.

    All gawk functions, including split and patsplit, work on locale-dependent characters, not bytes, per the documentation.

    Also, a single-character string such as "." as a field separator is treated literally rather than as a regex pattern (see the documentation on FS).

    Since there is no . in the input string of 😄, when you call split with "." as the field separator, split sees only 1 field.

    And since 😄 consists of 4 bytes and presumably you have set your locale to a byte-based one such as C, when you call patsplit with "." as the field pattern, each . matches one byte of 😄, producing an array of size 4.