Let's say that I want to include distilbert
https://huggingface.co/distilbert-base-uncased from Hugging Face into spaCy 3.0 pipeline. I think that this is possible and I found some code on how to convert this model for spaCy 2.0 but it doesn't work in v3.0. What I really want is to load this model using something like this
nlp = spacy.load('path_to_distilbert')
Is it even possible and could you please provide the exact steps to do that.
You can use spacy-transformers
to this end. In spaCy v3, you can train custom pipelines using a config file, where you would define the transformer
component using any HF model you like in components.transformer.model.name
:
[components.transformer]
factory = "transformer"
max_batch_items = 4096
[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.doc_spans.v1"
[components.transformer.set_extra_annotations]
@annotation_setters = "spacy-transformers.null_annotation_setter.v1"
You can then train any other component (NER, textcat, ...) on top of this pretrained transformer model, and the transformer weights will be further finetuned, too.
You can read more about this in the docs here: https://spacy.io/usage/embeddings-transformers#transformers-training