I want to use Chinese Bert model. In tokenization.py, I found a WordpieceTokenizer function, but I don't think it is needed to use WordPiece for Chinese, because the minimal unit of Chinese is character.
WordpieceTokenizer is just for English text, am I right?
From the README:
We use character-based tokenization for Chinese, and WordPiece tokenization for all other languages.
However, from the Multilingual README (emphasis added):
Because Chinese (and Japanese Kanji and Korean Hanja) does not have whitespace characters, we add spaces around every character in the CJK Unicode range before applying WordPiece.
So WordPiece is presumably run on the whole sentence, though it would only matter for sentences that contained non-Chinese characters. So to run the code as-is you would want WordPiece.
However, to clarify: