gounicodecjkpunctuationchinese-locale

How to Check If The Rune is Chinese Punctuation Character in Go


For Chinese punctuation chars like , how to detect via Go?

I tried with range table of package unicode like the code below, but Han doesn't include those punctuation chars.

Can you please tell me which range table should I use for this task? (Please refraining from using regex because it's low performance.)

for _, r := range strToDetect {
    if unicode.Is(unicode.Han, r) {
        return true
    }
}

Solution

  • Puctuation marks are scattered about in different Unicode code blocks.


    The Unicode® Standard
    Version 14.0 – Core Specification

    Chapter 6
    Writing Systems and Punctuation
    https://www.unicode.org/versions/latest/ch06.pdf

    Punctuation. The rest of this chapter deals with a special case: punctuation marks, which tend to be scattered about in different blocks and which may be used in common by many scripts. Punctuation characters occur in several widely separated places in the blocks, including Basic Latin, Latin-1 Supplement, General Punctuation, Supplemental Punctuation, and CJK Symbols and Punctuation. There are also occasional punctuation characters in blocks for specific scripts.


    Here are two of your examples,

    〜 Wave Dash U+301C

    。Ideographic Full Stop U+3002


    package main
    
    import (
        "fmt"
        "unicode"
    )
    
    func main() {
        // CJK Symbols and Punctuation Unicode block
        for r := rune('\u3000'); r <= '\u303F'; r++ {
            if unicode.IsPunct(r) {
                fmt.Printf("%[1]U\t%[1]c\n", r)
            }
        }
    }
    

    https://go.dev/play/p/WoJjM6JKTYR

    U+3001  、
    U+3002  。
    U+3003  〃
    U+3008  〈
    U+3009  〉
    U+300A  《
    U+300B  》
    U+300C  「
    U+300D  」
    U+300E  『
    U+300F  』
    U+3010  【
    U+3011  】
    U+3014  〔
    U+3015  〕
    U+3016  〖
    U+3017  〗
    U+3018  〘
    U+3019  〙
    U+301A  〚
    U+301B  〛
    U+301C  〜
    U+301D  〝
    U+301E  〞
    U+301F  〟
    U+3030  〰
    U+303D  〽