[SOLVED] Dose Chinese need WordPiece?

Dose Chinese need WordPiece?

I want to use Chinese Bert model. In tokenization.py, I found a WordpieceTokenizer function, but I don't think it is needed to use WordPiece for Chinese, because the minimal unit of Chinese is character.

WordpieceTokenizer is just for English text, am I right?

Solution

From the README:

We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages.

However, from the Multilingual README (emphasis added):

Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece.

So WordPiece is presumably run on the whole sentence, though it would only matter for sentences that contained non-Chinese characters. So to run the code as-is you would want WordPiece.

However, to clarify:

WordPiece is not just for English, it can be used on any language and in practice is used on many
Whether single character-based tokenization for Chinese is the best decision is debated
WordPiece is not available outside Google, SentencePiece could be used as a replacement (though I think the BERT code might have a pretrained model)