nltknamed-entity-extraction

What are the entity types for NLTK?


I've been trying to find the full list of entity types of NLTK. I was only able to find the most common ones on this page, but not the full list. Could you please share the full list of named entity types NLTK has?


Solution

  • That's a very good question, I've wondered the same myself. It doesn't seem to be documented anywhere, even in the nltk source, and of course it is determined by the corpus that the chunker was trained on-- which, it seems, is or was the ACE corpus, which is not distributed with the nltk.

    A little bit of digging around in the source turned up the answer:

    >>> chunker=nltk.data.load(nltk.chunk._MULTICLASS_NE_CHUNKER) # cf. nltk/chunk/__init__.py
    >>> sorted(chunker._tagger._classifier.labels())
    ['B-FACILITY', 'B-GPE', 'B-GSP', 'B-LOCATION', 'B-ORGANIZATION', 'B-PERSON', 
     'I-FACILITY', 'I-GPE', 'I-GSP', 'I-LOCATION', 'I-ORGANIZATION', 'I-PERSON',
     'O']
    

    Note that some of the "common" types mentioned in the book, including DATE and TIME, are not actually detected by this chunker. GPE stands for Geo-Political Entity, GSP stands for Geographical-Social-Political Entity, an older tag that was replaced by GPE in the ACE project. From their definition (see links below) they seem to be pretty much equivalent.

    Edit, January 2019: Prompted by Daniel's question, I looked at the documentation of the ACE project myself in search of a description of these entities. Sure enough, this page links to documentation for each phase of the project. The entity names listed above, including the mysterious GSP but without the GPE entity, were used through phase 1 of the project. Starting with phase 2, GPE replaced GSP on the list. One has to wonder how the nltk chunker ended up being trained on both GPE and GSP, or how it decides between the two. My best guess is that it was trained on a combination of Phase 1 and Phase 2 materials.