pythonspacysentence

Manually set sentence boundaries in Spacy


Suppose I know ahead of time the character-level sentence boundaries in a document:

text = "The cat chased the mouse. The mouse ran away."
boundaries = [(0, 25), (26, 45)]
for start, end in boundaries:
    print(text[start:end])

Is there a way that I can tell Spacy to use these boundaries? From what I can gather in the official docs and elsewhere on SO, the hooks provided seem more suited to support custom stateless rules that apply at the word (token) level.


Solution

  • You can't put sentence boundaries at arbitrary characters - spaCy won't let you put a sentence in the middle of a token.

    What you can do is iterate over tokens and use token.idx (the character index of the token) to apply your boundaries by finding the token that lines up with your boundary index. You'll have to figure out a policy for what to do if token boundaries don't line up with your values, whether that's throwing an exception or dealing with it somehow.