pythontensorflownlptokenizedata-preprocessing

Tokenizing very large text datasets (cannot fit in RAM/GPU Memory) with Tensorflow


How do we tokenize very large text datasets that don't fit into memory in Tensorflow? For image datasets, there is the ImageDataGenerator that loads the data per batch to the model, and preprocesses the data. However for text datasets, tokenization is performed before training the model. Can the dataset be split into batches for the tokenizer or is there a Tensorflow batch tokenizer function that already exist? Can this be done without having to import external libraries?

I know that there are external libraries that does this. For example from, https://github.com/huggingface/transformers/issues/3851


Solution

  • In your case you need to define your own data processing pipeline using the tf.data module. Based on this module you can define an own/customized tf.data.Dataset. Those datasets support a lot of features like parsing of records into a specific format (using the map function) or batching.

    Here is a complete example of how you could use the tf.data module for building your own pipeline: https://www.tensorflow.org/guide/data