nlpentitynamed-entity-recognitionnamedpart-of-speech

Rule based named entity recognizer without parts of speech label or any other information


I'm working on a project where I am trying to build a named entity recognizer from texts. So basically I want to build and experiment the NER in 3 different ways.

First, I want to build it using only segmented sentences-> tokenized words. To clarify, I want to input only split/tokenized words into the system. Once again, the NER system is rule-based. Hence, it can only use rules to conclude which is a named entity. In the first NER, it will not have any chunk information or part of speech label. Just the tokenized words. Here, the efficiency is not the concern. Rather the concern lies in comparing the 3 different NERs, how they perform. (The one I am asking about is the 1st one).

I thought of it for a while and could not figure out any rules or any idea of coming up with a solution to this problem. One naive approach would be to conclude all words beginning with an uppercase and that does not follow a period to be a named entity.

Am I missing anything? Any heads up or guidelines would help.


Solution

  • Typically NER relies on preprocessing such as part-of-speech tagging (named entities are typically nouns), so not having this basic information makes the task more difficult and therefore more prone to error. There will be certain patterns that you could look for, such as the one you suggest (although what do you do with sentence-initial named entities?). You could add certain regular expression patterns with prepositions, e.g. (Title_case_token)+ of (the)? (Title_case_token)+ would match "Leader of the Free World", "Prime Minister of the United Kindom", and "Alexander the Great". You might also want to consider patterns to match acronyms such as "S.N.C.F.", "IBM", "UN", etc. A first step is probably to look for some lexical resources (i.e. word lists) like country names, first names, etc., and build from there.

    You could use spaCy (Python) or TokensRegex (Java) to do token-based matching (and not use the linguistic features they add to the tokens).