amazon-s3huggingface-transformers

Reading a pretrained huggingface transformer directly from S3


Loading a huggingface pretrained transformer model seemingly requires you to have the model saved locally (as described here), such that you simply pass a local path to your model and config:

model = PreTrainedModel.from_pretrained('path/to/model', local_files_only=True)

Can this be achieved when the model is stored on S3?


Solution

  • Answering my own question... (apparently encouraged)

    I achieved this using a transient file (NamedTemporaryFile), which does the trick. I was hoping to find an in-memory solution (i.e. passing in the BytesIO directly to from_pretrained) but that would require a patch to the transformers codebase

    import boto3 
    import json 
    
    from contextlib import contextmanager 
    from io import BytesIO 
    from tempfile import NamedTemporaryFile 
    from transformers import PretrainedConfig, PreTrainedModel 
      
    @contextmanager 
    def s3_fileobj(bucket, key): 
        """
        Yields a file object from the filename at {bucket}/{key}
    
        Args:
            bucket (str): Name of the S3 bucket where you model is stored
            key (str): Relative path from the base of your bucket, including the filename and extension of the object to be retrieved.
        """
        s3 = boto3.client("s3") 
        obj = s3.get_object(Bucket=bucket, Key=key) 
        yield BytesIO(obj["Body"].read()) 
     
    def load_model(bucket, path_to_model, model_name='pytorch_model'):
        """
        Load a model at the given S3 path. It is assumed that your model is stored at the key:
    
            '{path_to_model}/{model_name}.bin'
    
        and that a config has also been generated at the same path named:
    
            f'{path_to_model}/config.json'
    
        """
        tempfile = NamedTemporaryFile() 
        with s3_fileobj(bucket, f'{path_to_model}/{model_name}.bin') as f: 
            tempfile.write(f.read()) 
     
        with s3_fileobj(bucket, f'{path_to_model}/config.json') as f: 
            dict_data = json.load(f) 
            config = PretrainedConfig.from_dict(dict_data) 
     
        model = PreTrainedModel.from_pretrained(tempfile.name, config=config) 
        return model 
         
    model = load_model('my_bucket', 'path/to/model')