I was looking at the vocabulary of GPT-2.
https://huggingface.co/gpt2/blob/main/vocab.json
I found to my surprise very weird tokens that I did not expect. For example, it contains the token (index 35496): ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ
How did this happen is this token common in the GPT-2 training data?? In general, how was the vocabulary for GPT-2 built, and is there a problem here?
Information about the model available here https://huggingface.co/gpt2
The OpenAI team wanted to train this model on a corpus as large as possible. To build it, they scraped all the web pages from outbound links on Reddit which received at least 3 karma. Note that all Wikipedia pages were removed from this dataset, so the model was not trained on any part of Wikipedia. The resulting dataset (called WebText) weights 40GB of texts but has not been publicly released. You can find a list of the top 1,000 domains present in WebText here.
Accordingly to hugging face GPT2Tokenizer, the tokenizer is based on BPE, such token could have ended up there due to an encoding issue.
You can see that this is the the char codes for ÃÂ
are 195
, and 194
, C3 C2
that could be a two-byte encoded character in a different encoding? Or part of binary data that leaked into the corpus?
If that token was not frequent it is likely that it will never be relevant at the output. But it is an issue in the sense that the model wastes resources describing the behavior for that token.