I'm really stumped by trying to match Chinese characters using a Perl one liner in zsh. I canot get \p{script=Han} to match Chinese characters, but \P{script=Han} does.
Task: I need to change this:
一
<lb/> 二
to this:
<tag ref="一二">一
<lb/> 二</tag>
There could be a variable number of tags, newlines, whitespaces, tabs, alphanumeric characters, digits, etc. between the two Chinese characters. I believe the most efficient and robust way to do this would be to look for something that is *not a Chinese character.
My attempted solution:
perl -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g'
This has the desired effect when applied to the example above.
Problem: The issue I am having is that \P{script=Han} (or \p{^script=Han}) matches Chinese characters as well.
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters. When trying to match \P{script=Han}, the regex matches every character in the file.
I don't know why.
This is a problem because in the case of this situation, the output is not as desired:
一
<lb/> 三二
becomes
<tag ref="一二">一
<lb/> 三二</tag>
I don't want this to be matched at all- just instances where 一 and 二 are separated only by characters that are not Chinese characters.
Can anyone tell me what I'm doing wrong? Or suggest a workaround? Thanks!
When I try to match \p{script=Han}, the regex matches nothing despite it being a file full of Chinese characters.
The problem is that both your script and your input file are UTF-8 encoded, but you do not say so to perl. If you do not tell perl, it will assume that they are ASCII encoded.
To say that your script is UTF-8 encoded, use the utf8 pragma. To tell perl that all files you open are UTF-8 encoded, use the -CD
command line option. So the following oneliner should solve your problem:
perl -Mutf8 -CD -0777 -pi -e 's/(一)(\P{script=Han}*?)(二)/<tag ref="$1$3">$2<\/tag>/g' file