amazon-sagemakerhuggingface-transformershuggingface-tokenizersamz-sagemaker-distributed-training

Create Hugging Face Transformers Tokenizer using Amazon SageMaker in a distributed way


I am using the SageMaker HuggingFace Processor to create a custom tokenizer on a large volume of text data. Is there a way to make this job data distributed - meaning read partitions of data across nodes and train the tokenizer leveraging multiple CPUs/GPUs.

At the moment, providing more nodes to the processing cluster merely replicates the tokenization process (basically duplicates the process of creation), which is redundant. You can primarily only scale vertically.

Any insights into this?


Solution

  • Considering the following example code for HuggingFaceProcessor:

    If you have 100 large files in S3 and use a ProcessingInput with s3_data_distribution_type="ShardedByS3Key" (instead of FullyReplicated), the objects in your S3 prefix will be sharded and distributed to your instances.

    For example, if you have 100 large files and want to filter records from them using HuggingFace on 5 instances, the s3_data_distribution_type="ShardedByS3Key" will put 20 objects on each instance, and each instance can read the files from its own path, filter out records, and write (uniquely named) files to the output paths, and SageMaker Processing will put the filtered files in S3.

    However, if your filtering criteria is stateful or depends on doing a full pass over the dataset first (such as: filtering outliers based on mean and standard deviation on a feature - in case of using SKLean Processor for example): you'll need to pass that information in to the job so each instance can know how to filter. To send information to the instances launched, you have to use the /opt/ml/config/resourceconfig.json file:

    { "current_host": "algo-1", "hosts": ["algo-1","algo-2","algo-3"] }