I am adding a custom component to spaCy but it never gets called:
@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
print(".")
for token in doc[:-1]:
if token.text == "\n":
doc[token.i + 1].is_sent_start = True
return doc
nlp = spacy.load("de_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
I get a result in sentences
and the analyzer does list my component but my custom component seams to have no effect and I never see the dots from the print appearing...
Any ideas?
In the code which you have pasted:
You are doing :
nlp = spacy.load("de_core_web_sm")
However, it should be :
nlp = spacy.load("en_core_web_sm")
I tried to reproduce your code and I got the result
@Language.component("custom_sentence_boundaries")
def custom_sentence_boundaries(doc):
print("...$...") # I am printing "...$..." so that it is visible easily
for token in doc[:-1]:
if token.text == "\n":
doc[token.i + 1].is_sent_start = True
return doc
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("custom_sentence_boundaries", after="parser")
nlp.analyze_pipes(pretty=True)
text = ("When Sebastian Thrun started working on self-driving cars at "
"Google in 2007, few people outside of the company took him "
"seriously. “I can tell you very senior CEOs of major American "
"car companies would shake my hand and turn away because I wasn’t "
"worth talking to,” said Thrun, in an interview with Recode earlier "
"this week.")
doc = nlp(text)
sentences = [sent.text for sent in doc.sents]
#Output
(please see at the bottom ...$...
is printed and custom_sentence_boundaries
is printed after parser
as we have stated after="parser"
in keyword argument)
============================= Pipeline Overview =============================
# Component Assigns Requires Scores Retokenizes
- -------------------------- ------------------- -------- ---------------- -----------
0 tok2vec doc.tensor False
1 tagger token.tag tag_acc False
2 parser token.dep dep_uas False
token.head dep_las
token.is_sent_start dep_las_per_type
doc.sents sents_p
sents_r
sents_f
3 custom_sentence_boundaries False
4 attribute_ruler False
5 lemmatizer token.lemma lemma_acc False
6 ner doc.ents ents_f False
token.ent_iob ents_p
token.ent_type ents_r
ents_per_type
✔ No problems found.
...$...