I use the benepar parser to parse sentences into trees. How can I prevent the benepar parser from splitting a specific substring when parsing a string?
E.g., the token gonna
is split by benepar into two tokens gon
and na
, which I don't want.
Code example, with pre-requisites:
pip install spacy benepar
python -m nltk.downloader punkt benepar_en3
python -m spacy download en_core_web_md
If I run:
import benepar, spacy
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md')
if spacy.__version__.startswith('2'):
nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
else:
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
print(sent._.parse_string)
It'll output:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gon) (VP (TO na) (VP (VB be) (NP (NN fun)))))) (. .))
The issue is that the token gonna
is split into two tokens gon
and na
. How can I prevent that?
Use nlp.tokenizer.add_special_case
:
import benepar, spacy
import nltk
benepar.download('benepar_en3')
nlp = spacy.load('en_core_web_md')
from spacy.symbols import ORTH
nlp.tokenizer.add_special_case(u'gonna', [{ORTH: u'gonna'}])
if spacy.__version__.startswith('2'):
nlp.add_pipe(benepar.BeneparComponent("benepar_en3"))
else:
nlp.add_pipe("benepar", config={"model": "benepar_en3"})
doc = nlp("This is gonna be fun.")
sent = list(doc.sents)[0]
print(sent._.parse_string)
This is the output for the above code:
(S (NP (DT This)) (VP (VBZ is) (VP (TO gonna) (VP (VB be) (NP (NN fun))))) (. .))