Address Splitting with NLP
Asked Answered
C

1

9

I am working currently on a project that should identify each part of an address, for example from "str. Jack London 121, Corvallis, ARAD, ap. 1603, 973130 " the output should be like this:

street name: Jack London; 
no: 121; city: Corvallis; 
state: ARAD; 
apartment: 1603; 
zip code: 973130

The problem is that not all of the input data are in the same format so some of the elements may be missing or in different order, but it is guaranteed to be an address.

I checked some sources on the internet, but a lot of them are adapted for US addresses only - like Google API Places, the thing is that I will use this for another country.

Regex is not an option since the address may variate too much.

I also thought about NLP to use Named Entity Recognition model but I'm not sure that will work.

Do you know what could a be a good way to start, and maybe help me with some tips?

Costumer answered 24/3, 2021 at 10:18 Comment(1)
if you find my answer useful could you please mark it as accepted (gray tickmark on the left of the answer)?Sisley
S
12

There is a similar question in Data Science Stack Exchange forum with only one answer suggesting using SpaCy.

Another question on detecting addresses using Stanford NLP details another approach to detecting addresses and its constituents.

There is a LexNLP library that has a feature to detect and split addresses this way (snippet borrowed from TowardsDatascience article on the library):

from lexnlp.extract.en.addresses import addresses
for filename,text in d.items():
    print(list(lexnlp.extract.en.addresses.get_addresses(text)))

There is also a relatively new (2018) and "researchy" code DeepParse (and documentation) for deep learning address parsing accompanying an IEEE article (paywall) or Semantic Scholar.

For the training you will need to use some large corpora of addresses or fake addresses generated using, e.g. Faker library.

Sisley answered 24/3, 2021 at 11:57 Comment(2)
LexNLP seems to be the best out of the box tool I've found. It's a bit of a chore to get running due to Python dependencies and things. The article you linked for it got me started, but it's wrong and/or outdated now. For example, the get_word_features() function is missing an argument and isn't very helpful. lexnlp.extract.en.addresses.get_addresses() though, is magical. Thanks very much.Nehemiah
Thank you. Updated the answer to reflect the current API of the library.Sisley

© 2022 - 2024 — McMap. All rights reserved.