pythonnlpstreet-addressnamed-entity-recognition

Address Splitting with NLP


I am working currently on a project that should identify each part of an address, for example from "str. Jack London 121, Corvallis, ARAD, ap. 1603, 973130 " the output should be like this:

street name: Jack London; 
no: 121; city: Corvallis; 
state: ARAD; 
apartment: 1603; 
zip code: 973130

The problem is that not all of the input data are in the same format so some of the elements may be missing or in different order, but it is guaranteed to be an address.

I checked some sources on the internet, but a lot of them are adapted for US addresses only - like Google API Places, the thing is that I will use this for another country.

Regex is not an option since the address may variate too much.

I also thought about NLP to use Named Entity Recognition model but I'm not sure that will work.

Do you know what could a be a good way to start, and maybe help me with some tips?


Solution

  • There is a similar question in Data Science Stack Exchange forum with only one answer suggesting using SpaCy.

    Another question on detecting addresses using Stanford NLP details another approach to detecting addresses and its constituents.

    There is a LexNLP library that has a feature to detect and split addresses this way (snippet borrowed from TowardsDatascience article on the library):

    from lexnlp.extract.en.addresses import addresses
    for filename,text in d.items():
        print(list(lexnlp.extract.en.addresses.get_addresses(text)))
    

    There is also a relatively new (2018) and "researchy" code DeepParse (and documentation) for deep learning address parsing accompanying an IEEE article (paywall) or Semantic Scholar.

    For the training you will need to use some large corpora of addresses or fake addresses generated using, e.g. Faker library.