regexgrepcjk

GNU grep regex `[一-十]` (one to ten) does not match the Chinese character 四 (four)


This command

$ echo '一二三四五六七八九十' | grep -oE '[一-十]'

outputs:

一
二
三
五
六
七
八
九
十

The regex [一-十] (one to ten) is expected to match against Chinese numbers. As the example shows, it matches against every Chinese number from one to ten, except the Chinese character (four)。

Why?

Is this a bug or a joke?

I may think this as joke, because in Chinese '四' (four) sounds alike '事' (thing). In fact, in some dialects of Chinese, they share a same pronunciation. Thus '一二三五六七八九十' (one two three five six seven eight nine ten) implies '沒四' (no four), i.e. '沒事' (no thing)。

BTW, the version of the grep I use:

GNU grep 2.5.4

Solution

  • The Chinese numbers are not in order in Unicode. That 四 is U+56DB, while 一 is U+4E00, and 10 is 5341. So the 4 doesn't fit.

    Read the Unicode standard for more information, and see http://www.unicode.org/charts/PDF/U4E00.pdf.