pythonnlpnamed-entity-recognitionfreeling

Python API usage for coreference, semantic graph and NERC


Intro

Hi, I have been using freeling for a few months now to extract triplets. So far I have succeded in doing so by using the dependency tree and the full parse tree, but I am trying to add NERC.

My work so far

I checked the tutorial for python, but I couldn't find anything beyond depdency parsing. So I went through the class list (since the same classes should be available for python and c++) but it is not very clear how to retrieve the named entities and after checking the output of the analyzer sampler I have a few questions about the performance of the NER module.

Problems

So what I'm asking if anyone can help me with is the following:

  1. Doubt about entities: Using the example "Sobre la mesa María ve y coge una manzana, un sombrero, una llave y dos paraguas rojo." I realized that working with capitalized words and lowercase produce different results, but by making it all lowercase the entity recognition stops recognizing "maría" as a person. Is there are workaround for this or am I going in the wrong direction? The main problem is that "maría" not recognized as a named entity (which i need it to be by the way) results in "maría" not being the subject of the sentence anymore. Im using:

neclass = pyfreeling.ner(lpath + "/nerc/ner/ner-ab-rich.dat")

  1. How to retrieve named entities: Kind of a follow up of the previous question, how do I get the named entities? I couldn't find any code related to this and the semantic graph i obtain holds 0 entities.

Any comments and suggestions are welcomed, thanks in advance.


Solution

  • Well aparently there are 3 NERC modules, one rule-based and two ML-based. All of them use capitalization as a feature, and since both models are trained on standard text, all NEs seen in training are capitalized. Therefore lowercase named entities are not likely to be recognized.

    About the retrieval it seems that the get_label() from the nodes can provide this info if a word (or multiword) has a pos-tag starting with "NP", then it means it was recognized by the NERC module.

    This is based on freelings authors own explanation which you can find here