[SOLVED] In HuggingFace tokenizers: how can I split a sequence simply on spaces?

In HuggingFace tokenizers: how can I split a sequence simply on spaces?

I am using DistilBertTokenizer tokenizer from HuggingFace.

I would like to tokenize my text by simple splitting it on space:

["Don't", "you", "love", "🤗", "Transformers?", "We", "sure", "do."]

instead of the default behavior, which is like this:

["Do", "n't", "you", "love", "🤗", "Transformers", "?", "We", "sure", "do", "."]

I read their documentation about Tokenization in general as well as about BERT Tokenizer specifically, but could not find an answer to this simple question :(

I assume that it should be a parameter when loading Tokenizer, but I could not find it among the parameters list ...

EDIT: Minimal code example to reproduce:

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')

tokens = tokenizer.tokenize("Don't you love 🤗 Transformers? We sure do.")
print("Tokens: ", tokens)

Solution

That is not how it works. The transformers library provides different types of tokenizers. In the case of distilbert it is a wordpiece tokenizer that has a defined vocabulary that was used to train the corresponding model and therefore does not offer such modifications (as far as I know). Something you can do is using the split() method of the python string:

text = "Don't you love 🤗 Transformers? We sure do."
tokens = text.split()
print("Tokens: ", tokens)

Output:

Tokens:  ["Don't", 'you', 'love', '🤗', 'Transformers?', 'We', 'sure', 'do.']

In case you are looking for a bit more complex tokenization that also takes the punctuation into account, you can utilize the basic_tokenizer:

from transformers import DistilBertTokenizer

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.basic_tokenizer.tokenize(text)
print("Tokens: ", tokens)

Output:

Tokens:  ['Don', "'", 't', 'you', 'love', '🤗', 'Transformers', '?', 'We', 'sure', 'do', '.']