I am just starting to work with Spacy and have put a text through to test how it is working on a pdf I OCR'd with AntFileConverter.
The txt file (sample below - would attach but unsure how) seems fine, is in UTF-8. However when I output the file in CONLL format, for some reason there are various apparent gaps, which have no original word, but seem to have been identified. This happens both at the end and within sentences.
"species in many waters in the northern hemisphere. In most countries in the region pike has both commercial and recreational value (Crossman & Casselman 1987; Raat 1988). Pike is a typical sit-and-wait predator which usually hunts prey by ambushing (Webb & Skadsen 1980)."
The output us as so:
GPE 24
26 species specie NNS 20 attr
27 in in IN 26 prep
28 many many JJ 29 amod
29 waters water NNS 27 pobj
30 in in IN 29 prep
31 the the DT 33 det
32 northern northern JJ 33 amod
33 hemisphere hemisphere NN 30 pobj
34 . . . 20 punct
1 In in IN 9 prep
2
GPE 1
3 most most JJS 4 amod
4 countries country NNS 9 nsubj
5 in in IN 4 prep
6 the the DT 8 det
7 region region NN 8 compound
8 pike pike NN 5 pobj
9 has have VBZ 0 ROOT
10 both both DT 11 preconj
11 commercial commercial JJ 9 dobj
12
GPE 11
13 and and CC 11 cc
14 recreational recreational JJ 15 amod
15 value value NN 11 conj
16 ( ( -LRB- 15 punct
17 Crossman crossman NNP ORG 15 appos
18 & & CC ORG 17 cc
19 Casselman casselman NNP ORG 17 conj
20 1987 1987 CD DATE 17 nummod
21 ; ; : 15 punct
22
GPE 21
23 Raat raat NNP 15 appos
24 1988 1988 CD DATE 23 nummod
25 ) ) -RRB- 15 punct
26 . . . 9 punct
1 Pike pike NNP 2 nsubj
2 is be VBZ 0 ROOT
3 a a DT 10 det
4 typical typical JJ 10 amod
5 sit sit NN 10 nmod
6 - - HYPH 5 punct
7 and and CC 5 cc
8 - - HYPH 9 punct
9 wait wait VB 5 conj
10 predator predator NN 2 attr
11
GPE 10
12 which which WDT 14 nsubj
13 usually usually RB 14 advmod
14 hunts hunt VBZ 10 relcl
15 prey prey NN 14 dobj
16 by by IN 14 prep
17 ambushing ambush VBG 16 pcomp
18 ( ( -LRB- 17 punct
19 Webb webb NNP 17 conj
20 & & CC 19 cc
21
I also tried without the NER print out but these gaps continue to be marked. I thought it might be related to the line breaks, so I also tried with a Linux-style EOL but that didn't make any difference.
This is the code I am using:
import spacy
import en_core_web_sm
nlp_en = en_core_web_sm.load()
input = open('./input/55_linux.txt', 'r').read()
doc = nlp_en(input)
for sent in doc.sents:
for i, word in enumerate(sent):
if word.head == word:
head_idx = 0
else:
head_idx = word.head.i - sent[0].i + 1
output = open('CONLL_output.txt', 'a')
output.write("%d\t%s\t%s\t%s\t%s\t%s\t%s\n"%(
i+1, # There's a word.i attr that's position in *doc*
word,
word.lemma_,
word.tag_, # Fine-grained tag
word.ent_type_,
str(head_idx),
word.dep_ # Relation
))
Has anyone else had this problem? If so, do you know how I can solve it?
This is a known bug in spaCy.
Until it is fixed, you will have to do some post-processing to get rid of those "blank" entities. Fortunately, this is easy enough, this snippet posted by the author of the library shows how:
def remove_whitespace_entities(doc):
doc.ents = [e for e in doc.ents if not e.text.isspace()]
return doc
nlp_en.add_pipe(remove_whitespace_entities, after='ner')
So, you first define a post-processing pipe that filters all entities with a text
solely consisting of whitespace characters (using isspace()
).
Then you add this pipe to the NLP pipeline, set to run after NER. Then any time you use nlp_en
after that, it will not return those entities.