Extracting and Identifying locations with NLP + Spacy

My goal is to be able to recognize (aka identify) and identify (aka name, retrieve an ID) locations from text using NLP. I'm using Spacy specifically.

There are about 1,000 possible locations, but the difficulty is that they are unlikely to be written in a fully qualified way, not to mention spelling mistakes and aliases. For example, the Mission neighborhood in San Francisco written in a fully-qualified way might be (1) Mission (2) City of San Francisco (3) San Francisco County (4) California (5) US. (The numbers are just to illustrate the separate pieces.) However, many people might write it as (1) Mission (2) City of San Francisco, or (1) Mission, or (1) Mission (2) City of San Francisco (4) California. (Not to mention that #1 might be called "Mission District", #2 might be called "San Francisco", #4 might be "CA", etc.)

So my goal is to be able have an ID for "Mission" and all other neighborhoods, and ID for California and some other states, etc. If the text is like Mission, San Francisco, CA then I get the Mission ID. If the text is like San Francisco, CA then I get the San Francisco ID.

It's also easy to create synthetic training data by creating aliases of the individual location pieces (e.g., (a) "City of San Francisco", (b) "San Francisco", (c) "San Francisco City") and permutations of the "name chain" (e.g, 1 + 2 + 3 + 4 + 5, 1 + 2, 1 + 2 + 5, etc) for each alias. Rough estimate is about 50 combinations of alias and name chain per location, or about O(50,000) total values.

So, extraction seems to be a good job for NER. The surrounding text usually has a bit of context (e.g., "Location: ...." or "Comes from ...".

However, I'm unsure about the ability to do identification. My understanding is that much of the NER identification (e.g., Spacy's EntityLinker, which I planned on using) relies on surrounding context. I expect that there will be very little surrounding context that would help disambiguate one of the O(1000) locations from others. I also understand that EntityLinker matches on the token is lookup and not statistical (in other words, the value is from disambiguating when you have multiple exact-string matches and not from disambiguating from multiple very-fuzzy matches).

The KnowledgeBase / LookupDB does have a mechanism for setting aliases, so I could add each permutation as an alias. But at that point I feel like I'm not getting any value out of the EntityLinker's statistical models.

If I have to create a gazetteer for the identification aspect, then maybe it makes sense to put all my effort into the gazetteer and skip the NER?

Solution

Thanks to Vimal for the thoughts. As I suspected, and Vishal confirmed, I needed to extract the string first, and then process it with a separate, non-NLP algorithm.

I ended up solving this in two ways, and wanted to document my findings.

With some testing I found that an LLM (GPT) was actually pretty effective at determining the "administrative hierarchy" given the extracted string. This prompt, for example:

Evaluate the following place identifier and determine the most likely place. List the administrative entities from the lowest-level to the highest-level. Explain your reasoning. """'Castro, San Francisco, U.S."""

returns this (plus some additional explanation):


The place identifier "Castro, San Francisco, U.S." likely refers to a specific location within the city of San Francisco in the United States.

I can't find the final version of my prompt at the moment, but I was able to tweak it to get it to provide JSON with the administrative entities in order (national, first level, second level, etc), plus I asked for any "geographical feature" as a catch-all. (In some cases, I found that my extracted term was something like "Bay Area, United States", and GPT was able to sort that out with the right prompting.) There was a little bit of hallucination in my testing, which worried me.

With all that being said, I ended up on a much lower-tech approach, along the lines of my original gazetteer idea. My original plan was to use the gazetteer idea and then sort out the remaining strings with GPT. I ended up matching like 98% using this approach, and the unmatched strings were pretty objectively wrong and not worth sending to GPT. (Like a city with an incorrect country.)

To create the gazetteer:

This approach used a dictionary which I compiled from the wikidata API search API. I did a first-pass and sent all the strings through Wikidata search to get the top 10 matching entities for each search string. E.g.,

https://www.wikidata.org/w/api.php?action=query&list=search&srsearch=castro%20san%20francisco%20california&srwhat=text&srlimit=10&srprop=titlesnippet|categorysnippet&srsort=incoming_links_desc

I did some pre-filtering to exclude any entity that wasn't an instance of or subclass of geographical features or administrative regions (using lists that I manually searched for and downloaded manually).

Then I ran all those entities through a method that stored the data (including name, aliases, lat/lng, etc) and walked up through the "administrative entity" and "country" paths to collect the family tree.

To search for place names:

Compiled a dictionary where the keys were entity names and aliases.
I took the list of places (castro, san francisco, california) and started with the lowest-level string (castro) and did a fuzzy search against the dictionary keys to look for a match. Anything that matched over, e.g., 85% was chosen as a candidate.
Then I created a list of all the candidate's parent (+grandparent/etc) names and aliases, and I looped through the next place names to try to match every place against a name in the parents list, and got those scores.
Added the scores up. Some other operations to divide by the number of places that I was looking at, bias toward places with fewer parents (so Castro Valley, California would score higher than Castro, San Francisco, California), etc.

All in all, this was surprisingly effective.