pythonhuggingface-tokenizers

Train Tokenizer with HuggingFace dataset


I'm trying to train the Tokenizer with HuggingFace wiki_split datasets. According to the Tokenizers' documentation at GitHub, I can train the Tokenizer with the following codes:

from tokenizers import Tokenizer
from tokenizers.models import BPE

tokenizer = Tokenizer(BPE())

# You can customize how pre-tokenization (e.g., splitting into words) is done:
from tokenizers.pre_tokenizers import Whitespace
tokenizer.pre_tokenizer = Whitespace()

# Then training your tokenizer on a set of files just takes two lines of codes:
from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train(files=["wiki.train.raw", "wiki.valid.raw", "wiki.test.raw"], trainer=trainer)

# Once your tokenizer is trained, encode any text with just one line:
output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
print(output.tokens)
# ["Hello", ",", "y", "'", "all", "!", "How", "are", "you", "[UNK]", "?"]

However, the example is to load from three files: wiki.train.raw, wiki.valid.raw and wiki.test.raw. In my case, I am loading from wiki_split dataset. My code is as follow:

from tokenizers.trainers import BpeTrainer

def iterator_wiki(dataset):
    for txt in dataset:
        if type(txt) != float:
            yield txt

trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(iterator_wiki(wiki_train), trainer=trainer)

The tokenizer.train_from_iterator() only accepts 1 dataset split, how can I use the validation and test split here?


Solution

  • Use the iterator which iterates over all the 3 datasets one after the another. Reference

    Also note that each element in the wiki_split dataset is a dictionary. First element of train dataset is shown below:

    {'complex_sentence': "'' New Day '' is a song by American hip hop recording artist 50 Cent , released on July 27 , 2012 , as an promotional single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) .",
     'simple_sentence_1': "'' New Day '' is a song by American hip hop recording artist 50 Cent . ",
     'simple_sentence_2': " The song was released on July 27 , 2012 , as a single from his upcoming fifth studio album '' Street King Immortal '' ( 2013 ) ."}
    

    Working Example

    # Load the datasets
    from datasets import load_dataset
    train_dataset = load_dataset('wiki_split', split='train')
    test_dataset = load_dataset('wiki_split', split='test')
    val_dataset = load_dataset('wiki_split', split='validation')
    
    # Iterator using the text form complex_sentence
    def iterator_wiki(train_dataset, test_dataset, val_dataset):
      for mydataset in [train_dataset, test_dataset, val_dataset]:
        for i, data in enumerate(mydataset):
          if isinstance(data.get("complex_sentence", None), str):
            yield data["complex_sentence"]
     
    from tokenizers.trainers import BpeTrainer
    tokenizer = Tokenizer(BPE())
    
    from tokenizers.pre_tokenizers import Whitespace
    tokenizer.pre_tokenizer = Whitespace()
    
    trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
    tokenizer.train_from_iterator(iterator_wiki(
        train_dataset, test_dataset, val_dataset), trainer=trainer)
    
    output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
    print(output.tokens)
    

    Output:

    ['Hello', ',', 'y', "'", 'all', '!', 'How', 'are', 'you', '😁', '?']