I am using DistilBertTokenizer
tokenizer from HuggingFace.
I would like to tokenize my text by simple splitting it on space:
["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]
instead of the default behavior, which is like this:
["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]
I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(
I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...
EDIT: Minimal code example to reproduce:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
print("Tokens: ", tokens)
That is not how it works. The transformers library provides different types of tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split() method of the python string:
text = "Don't you love 🤗 Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)
Output:
Tokens: ["Don't", 'you', 'love', '🤗', 'Transformers?', 'We', 'sure', 'do.']
In case you are looking for a bit more complex tokenization that also takes the punctuation into account, you can utilize the basic_tokenizer:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)
Output:
Tokens: ['Don', "'", 't', 'you', 'love', '🤗', 'Transformers', '?', 'We', 'sure', 'do', '.']