tagsspacynamed-entity-recognitionmining

Merging tags into my file using named entity annotation


While learning the basics of text mining i run into the following problem: I must use named entity annotation to find and locate named entities. However, when found, the tag must be included in the document. So for example: "Hello I am Koen" must result in "Hello I am < PERSON> Koen < /PERSON>.

I figured out how to find and label the named entities but I am stuck on getting them in the file in the right way. I've tried comparing if the ent.orth_ is in the file and then replace it with the tag + ent.orth_ + closing tag.

print([(X, X.ent_iob_, X.ent_type_) for X in doc])

I used this for locating where the entities are and where they start.

for ent in doc.ents:
    entities.append(ent.orth_ + ", " + ent.label_)

I used this for creating a variable with both the original form and the label.

Right now i have the variable with all original forms and labels and know where the entities start and end. However when trying to replace it somehow my knowledge runs short and can't find any similar examples.


Solution

  • Try this:

    import spacy
    
    nlp = spacy.load("en_core_web_sm")
    s ="Apple is looking at buying U.K. startup for $1 billion"
    doc = nlp(s)
    
    def replaceSubstring(s, replacement, position, length_of_replaced):
        s = s[:position] + replacement + s[position+length_of_replaced:]
        return(s)
    
    for ent in reversed(doc.ents):
        #print(ent.text, ent.start_char, ent.end_char, ent.label_)
        replacement = "<{}>{}</{}>".format(ent.label_,ent.text, ent.label_)
        position = ent.start_char
        length_of_replaced = ent.end_char - ent.start_char 
        s = replaceSubstring(s, replacement, position, length_of_replaced)
    
    print(s)
    #<ORG>Apple</ORG> is looking at buying <GPE>U.K.</GPE> startup for <MONEY>$1 billion</MONEY>