I am building a SpaCy pipeline and would like to split sentences into individual Doc
objects. According to the SpaCy documentation, custom pipeline components take a single Doc
object as input and return a single Doc
object. However, I would like to input paragraphs/sections of text and have the SpaCy pipeline pass each Doc
's sentences to the next component of the pipeline as its own Doc
. This does not seem to be possible and I was hoping to find a workaround.
We would like to do this because we have trained a textcat
component that classifies sentences that are relevant to our use case. However, without splitting the sentences before the textcat
component, we only get predictions on the entire Doc
object.
In essence, I would like to do the following, but wrapped up into a SpaCy @Language.component
:
relevant_sents = []
for sent in doc.sents:
cats = textcat_model(sent).cats
if cats.get("RELEVANT") > 0.5:
relevant_sents.append(sent)
However, to achieve this, I need some sort of work around to feed individual sentences into the textcat
component of the pipeline as Doc
objects.
Any help would be greatly appreciated!
Components have to return the same number of Docs that they take in, that can't be changed.
What you should do in this situation is have two pipelines / nlp
objects. Use the sentences on the first one to create Docs that you pass to the second one.
In the past pipelines only took text as input, but recently it's become possible to pass Docs as well. When passing a Doc, tokenization is skipped but other components are run.