[SOLVED] Splitting SpaCy Docs into sentences in custom pipeline component

Splitting SpaCy Docs into sentences in custom pipeline component

I am building a SpaCy pipeline and would like to split sentences into individual Doc objects. According to the SpaCy documentation, custom pipeline components take a single Doc object as input and return a single Doc object. However, I would like to input paragraphs/sections of text and have the SpaCy pipeline pass each Doc's sentences to the next component of the pipeline as its own Doc. This does not seem to be possible and I was hoping to find a workaround.

We would like to do this because we have trained a textcat component that classifies sentences that are relevant to our use case. However, without splitting the sentences before the textcat component, we only get predictions on the entire Doc object.

In essence, I would like to do the following, but wrapped up into a SpaCy @Language.component:

relevant_sents = []
for sent in doc.sents:
    cats = textcat_model(sent).cats
    if cats.get("RELEVANT") > 0.5:
        relevant_sents.append(sent)

However, to achieve this, I need some sort of work around to feed individual sentences into the textcat component of the pipeline as Doc objects.

Any help would be greatly appreciated!

Solution

Components have to return the same number of Docs that they take in, that can't be changed.

What you should do in this situation is have two pipelines / nlp objects. Use the sentences on the first one to create Docs that you pass to the second one.

In the past pipelines only took text as input, but recently it's become possible to pass Docs as well. When passing a Doc, tokenization is skipped but other components are run.