pythonpalantir-foundryfoundry-code-repositories

Pass a whole dataset contains multiple files to HuggingFace function in Palantir


I am using the pre-trained model from HuggingFace (dslim/bert-base-NER). Normally when working locally or in a Colab notebook, we can use the code below to load the model:

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

This code will connect with the HuggingFace server to load the model directly. But on Palantir, we cannot do much unless configuring the security.

The workaround approach here is to download all the files and upload them to a dataset. We can pass a folder that contains these files to the HuggingFace function (check this answer) like below:

tokenizer = AutoTokenizer.from_pretrained('./local_model_directory/')
model = AutoModelForTokenClassification.from_pretrained('./local_model_directory/')

For a model that has only a pickle file, we can easily read that file via a dataset called classifier:

@transform(
    file_input=Input("/Users/model/classifier")
) 
def load_model(file_input):
    with file_input.filesystem().open("classifier.pkl", "rb") as f:
        model = pickle.load(f)
    return model

My question is: How can we open a dataset and pass a whole directory to this function on the Palantir code repository?


Solution

  • Update:

    palantir_models now publishes the below code under palantir_models.transforms.copy_model_to_driver that can be used in place of copy_to_temp_directory.

    from palantir_models.transforms import ModelOutput, copy_model_to_driver
    
    @transform(
        file_input=Input("/Users/model/hugging_face_files"),
        model_output=ModelOutput("/Users/model/model_asset")
    ) 
    def load_model(file_input, model_output):
        temp_dir = copy_to_temp_directory(file_input.filesystem())
        tokenizer = AutoTokenizer.from_pretrained(temp_dir)
        model = AutoModel.from_pretrained(temp_dir)
    
        # Wrap files in a model adapter
        foundry_model = HuggingFaceModelAdapter(tokenizer, model)
        model_output.publish(model_adapter=foundry_model)
    

    Original answer:

    You can copy the files into a TemporaryDirectory with the below util copy_to_temp_directory. I've also provided an example usage.

    @transform(
        file_input=Input("/Users/model/hugging_face_files"),
        model_output=ModelOutput("/Users/model/model_asset")
    ) 
    def load_model(file_input, model_output):
        temp_dir = copy_to_temp_directory(file_input.filesystem())
        tokenizer = AutoTokenizer.from_pretrained(temp_dir)
        model = AutoModel.from_pretrained(temp_dir)
        
        # Wrap files in a model adapter
        foundry_model = HuggingFaceModelAdapter(tokenizer, model)
        model_output.publish(model_adapter=foundry_model)
    
    
    def copy_to_temp_directory(foundry_filesystem):
        import shutil
        import tempfile
        import os
    
        temp_dir = tempfile.mkdtemp()
        for raw_file in foundry_filesystem.ls():
            output_path = os.path.join(temp_dir, raw_file.path)
            with foundry_filesystem.open(raw_file.path, "rb") as remote:
                with open(output_path, "wb") as local:
                    shutil.copyfileobj(remote, local)
    
        return temp_dir
    

    If your hugging face model large, you may need to specify a larger spark profile like DRIVER_MEMORY_MEDIUM or DRIVER_MEMORY_LARGE to ensure the spark driver has enough memory for the model files to load correctly.

    Hopefully this helps!