I am using the pre-trained model from HuggingFace (dslim/bert-base-NER). Normally when working locally or in a Colab notebook, we can use the code below to load the model:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
This code will connect with the HuggingFace server to load the model directly. But on Palantir, we cannot do much unless configuring the security.
The workaround approach here is to download all the files and upload them to a dataset. We can pass a folder that contains these files to the HuggingFace function (check this answer) like below:
tokenizer = AutoTokenizer.from_pretrained('./local_model_directory/')
model = AutoModelForTokenClassification.from_pretrained('./local_model_directory/')
For a model that has only a pickle
file, we can easily read that file via a dataset called classifier
:
@transform(
file_input=Input("/Users/model/classifier")
)
def load_model(file_input):
with file_input.filesystem().open("classifier.pkl", "rb") as f:
model = pickle.load(f)
return model
My question is: How can we open a dataset and pass a whole directory to this function on the Palantir code repository?
Update:
palantir_models
now publishes the below code under palantir_models.transforms.copy_model_to_driver
that can be used in place of copy_to_temp_directory
.
from palantir_models.transforms import ModelOutput, copy_model_to_driver
@transform(
file_input=Input("/Users/model/hugging_face_files"),
model_output=ModelOutput("/Users/model/model_asset")
)
def load_model(file_input, model_output):
temp_dir = copy_to_temp_directory(file_input.filesystem())
tokenizer = AutoTokenizer.from_pretrained(temp_dir)
model = AutoModel.from_pretrained(temp_dir)
# Wrap files in a model adapter
foundry_model = HuggingFaceModelAdapter(tokenizer, model)
model_output.publish(model_adapter=foundry_model)
Original answer:
You can copy the files into a TemporaryDirectory with the below util copy_to_temp_directory
. I've also provided an example usage.
@transform(
file_input=Input("/Users/model/hugging_face_files"),
model_output=ModelOutput("/Users/model/model_asset")
)
def load_model(file_input, model_output):
temp_dir = copy_to_temp_directory(file_input.filesystem())
tokenizer = AutoTokenizer.from_pretrained(temp_dir)
model = AutoModel.from_pretrained(temp_dir)
# Wrap files in a model adapter
foundry_model = HuggingFaceModelAdapter(tokenizer, model)
model_output.publish(model_adapter=foundry_model)
def copy_to_temp_directory(foundry_filesystem):
import shutil
import tempfile
import os
temp_dir = tempfile.mkdtemp()
for raw_file in foundry_filesystem.ls():
output_path = os.path.join(temp_dir, raw_file.path)
with foundry_filesystem.open(raw_file.path, "rb") as remote:
with open(output_path, "wb") as local:
shutil.copyfileobj(remote, local)
return temp_dir
If your hugging face model large, you may need to specify a larger spark profile like DRIVER_MEMORY_MEDIUM
or DRIVER_MEMORY_LARGE
to ensure the spark driver has enough memory for the model files to load correctly.
Hopefully this helps!