I'm training an allennlp crf_tagger. I'm using a predictor which is based on the SentenceTaggerPredictor. The issue is the tokenizer argument - in the case of the SentenceTaggerPredictor there's a language argument.
Since SentenceTaggerPredictor has language="en_core_web_sm" as a defauly argument, when I do
Predictor.from_path("model.tar.gz", "sentence_tagger")
The tokenizer is created using the default language. But what happens if the training data was tokenized using a different language. How do I specify the arguments for the predictor in the model config.json
such that Predictor.from_path
will be constructed using a non-default language?
The Predictor.from_path()
method has an overrides
parameter that you could use in this case. For example, Predictor.from_path("model.tar.gz", "sentence_tagger", overrides={"dataset_reader.tokenizer.language": "en"})
.