spacyspacy-transformers

Force 'parser' to not segment sentences?


Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?

So, here is the story: I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.

I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.

So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :

  1. disable the parser,
  2. add a custom Pipe to nlp to set Token.is_sent_start how I want,
  3. create the Doc with nlp("an example")

or, I simply create a Doc with

doc = Doc(words=["an", "example"], sent_starts=[True, False])

and then I apply every element of the pipeline except the parser.

However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:

  1. Keep the correct segmentation in a list sentences = list(doc.sents)
  2. Apply the parser on the doc
  3. Work with whatever syntactic information the parser computed
  4. Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.

It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?

Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...

Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)


Solution

  • The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.

    If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.

    Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.

    Regarding some of your other points...

    I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.

    You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)