I am having a hard time figuring out how to assemble spacy pipelines bit by bit from built in models in spacy V3. I have downloaded the en_core_web_sm
model and can load it with nlp = spacy.load("en_core_web_sm")
. Processing of sample text works just fine like this.
Now what I want though is to build an English pipeline from blank and add components bit by bit. I do NOT want to load the entire en_core_web_sm
pipeline and exclude components. For the sake of concreteness let's say I only want the spacy default tagger
in the pipeline. The documentation suggests to me that
import spacy
from spacy.pipeline.tagger import DEFAULT_TAGGER_MODEL
config = {"model": DEFAULT_TAGGER_MODEL}
nlp = spacy.blank("en")
nlp.add_pipe("tagger", config=config)
nlp("This is some sample text.")
should work. However I am getting this error related to hashembed
:
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1000, in __call__
doc = proc(doc, **component_cfg.get(name, {}))
File "spacy/pipeline/trainable_pipe.pyx", line 56, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/util.py", line 1507, in raise_error
raise e
File "spacy/pipeline/trainable_pipe.pyx", line 52, in spacy.pipeline.trainable_pipe.TrainablePipe.__call__
File "spacy/pipeline/tagger.pyx", line 111, in spacy.pipeline.tagger.Tagger.predict
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 315, in predict
return self._func(self, X, is_train=False)[0]
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 30, in forward
return _ragged_forward(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/with_array.py", line 90, in _ragged_forward
Y, get_dX = layer(Xr.dataXd, is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in forward
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/concatenate.py", line 44, in <listcomp>
Ys, callbacks = zip(*[layer(X, is_train=is_train) for layer in model.layers])
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/chain.py", line 54, in forward
Y, inc_layer_grad = layer(X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 291, in __call__
return self._func(self, X, is_train=is_train)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/layers/hashembed.py", line 61, in forward
vectors = cast(Floats2d, model.get_param("E"))
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/model.py", line 216, in get_param
raise KeyError(
KeyError: "Parameter 'E' for model 'hashembed' has not been allocated yet."
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-8e2b4cf9fd33>", line 8, in <module>
nlp("This is some sample text.")
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1003, in __call__
raise ValueError(Errors.E109.format(name=name)) from e
ValueError: [E109] Component 'tagger' could not be run. Did you forget to call `initialize()`?
hinting I should run initialize()
. Ok. If I then run nlp.initialize()
I finally get this error
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-3-eeec225a68df>", line 1, in <module>
nlp.initialize()
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1273, in initialize
proc.initialize(get_examples, nlp=self, **p_settings)
File "spacy/pipeline/tagger.pyx", line 271, in spacy.pipeline.tagger.Tagger.initialize
File "spacy/pipeline/pipe.pyx", line 104, in spacy.pipeline.pipe.Pipe._require_labels
ValueError: [E143] Labels for component 'tagger' not initialized. This can be fixed by calling add_label, or by providing a representative batch of examples to the component's `initialize` method.
Now I am a bit at a loss. Which label examples? Where do I take them from? Why doesn't the default model config take care of that? Do I have to tell spacy to use en_core_web_sm
somehow? If so, how can I do so without using spacy.load("en_core_web_sm")
and excluding a whole bunch of stuff? Thanks for your hints!
EDIT: Ideally, I would like to be able to load only parts of the pipeline from a modified config file, like nlp = English.from_config(config)
. I cannot even use the config file shipped with en_core_web_sm
as the resulting pipeline needs to be initialized as well, and upon nlp.initialize()
I now receive
Traceback (most recent call last):
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3437, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-67-eeec225a68df>", line 1, in <module>
nlp.initialize()
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 1246, in initialize
I = registry.resolve(config["initialize"], schema=ConfigSchemaInit)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 727, in resolve
resolved, _ = cls._make(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 776, in _make
filled, _, resolved = cls._fill(
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/thinc/config.py", line 848, in _fill
getter_result = getter(*args, **kwargs)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/language.py", line 98, in load_lookups_data
lookups = load_lookups(lang=lang, tables=tables)
File "/home/valentin/miniconda3/envs/eval/lib/python3.8/site-packages/spacy/lookups.py", line 30, in load_lookups
raise ValueError(Errors.E955.format(table=", ".join(tables), lang=lang))
ValueError: [E955] Can't find table(s) lexeme_norm for language 'en' in spacy-lookups-data. Make sure you have the package installed or provide your own lookup tables if no default lookups are available for your language.
hinting towards the fact that it doesn't find required lookup tables.
nlp.add_pipe("tagger")
adds a new blank/uninitialized tagger, not the tagger from en_core_web_sm
or any other pretrained pipeline. If you add the tagger this way, you need to initialize and train it before you can use it.
You can add a component from an existing pipeline using the source
option:
nlp = spacy.add_pipe("tagger", source=spacy.load("en_core_web_sm"))
That said, it's possible that the tokenization from spacy.blank("en")
is different from what the tagger in the source pipeline was trained on. In general (and especially once you move away from spacy's pretrained pipelines), you should also make sure the tokenizer settings are the same,
and loading while excluding components is an easy way to do this.
Alternatively, you can copy the tokenizer settings in addition to using nlp.add_pipe(source=)
for models like scispacy's en_core_sci_sm
, which is a good example of a pipeline the tokenization is not the same as spacy.blank("en")
:
nlp = spacy.blank("en")
source_nlp = spacy.load("en_core_sci_sm")
nlp.tokenizer.from_bytes(source_nlp.tokenizer.to_bytes())
nlp.add_pipe("tagger", source=source_nlp)