nlptext-mininginformation-extractionnamed-entity-recognitionnamed-entity-extraction

Methods for extracting locations from text?


What are the recommended methods for extracting locations from free text?

What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?

Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.

Does anybody know of better approaches?

Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.


Solution

  • All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)

    This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml

    You can easily find implementations in other programming languages.