I am trying to implement the minbpe
library in zig, using a wrapper over PCRE library.
The pattern in Python is r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
When I use the pattern with a UTF-8 encoded text like abcdeparallel १२४
, I get the following output:
>>> import regex as re
>>> p = re.compile(r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""")
>>> p
regex.Regex("'(?:[sdmt]|ll|ve|re)| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)|\\s+", flags=regex.V0)
>>> p.findall("abcdeparallel १२४")
['abcdeparallel', ' १२४']
It looks like this is more or less the same in PCRE flavored regex as well, with me just having to add a /g
flag in the end for UTF-8 matching
However when I try to use the pattern with pcre via the pcre2test tool on macOS, I get a much different output
$ pcre2test -8
PCRE2 version 10.42 2022-12-11
re> /'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
0: abcdeparallel
0: \xe0
0: \xa5\xa7
0: \xe0
0: \xa5\xa8
0: \xe0
0: \xa5
0: \xaa
Somehow it looks like the code points for the Hindi numerals (1, 2 4) are interpreted differently and the output is matched as a totally different set of characters
>>> "\xe0\xa5\xa7\xe0\xa5\xa8"
'१२'
Is there a flag or something that I am missing that must be passed to have the same behaviour as the the regex
Package/module from Python ? When UTF-8 code points are decoded into bytes, wouldn't the library know how to put them back together into the same code points ?
The Hindi codepoints are actually matched, but rendered on screen as UTF-8 hexcodes:
>>> "१२४".encode("utf-8")
b'\xe0\xa5\xa7\xe0\xa5\xa8\xe0\xa5\xaa'
According to the pcr2test spec:
When pcre2test is outputting text in the compiled version of a pattern, bytes other than 32-126 are always treated as non-printing characters and are therefore shown as hex escapes.
When pcre2test is outputting text that is a matched part of a subject string, it behaves in the same way, unless a different locale has been set for the pattern (using the locale modifier). In this case, the isprint() function is used to distinguish printing and non-printing characters.
The spec doesn't mention which locales can be used. The example (fr_FR) suggests two-letter country code and two-letter language code, but it's unclear to me if Hindi is supported.
With the `(*UTF) flag you do get two matches and the Hindi numerals are then rendered as unicode hexes:
re> /(*UTF)(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+/g
data> abcdeparallel १२४
0: abcdeparallel
0: \x{967}\x{968}\x{96a}