How can I extract address from raw text using NLTK in python?
Asked Answered
H

4

16

I have this text

'''Hi, Mr. Sam D. Richards lives here, 44 West 22nd Street, New York, NY 12345. Can you contact him now? If you need any help, call me on 12345678'''

. How the address part can be extracted from the above text using NLTK? I have tried Stanford NER Tagger, which gives me only New York as Location. How to solve this?

Herbage answered 10/6, 2016 at 10:22 Comment(5)
Most people would give regular expressions a try. Besides that, a short search on SO will give you plenty of inspiration.Fredericksburg
Thanks ! That gave me something to start with.Herbage
Accept the answer pleaseValediction
patrick, that one's in phpCavesson
here's a pretty solid python, nltk write up. i'll type it into an answer here with the summary after i implement it myself.Cavesson
V
16

Definitely regular expressions :)

Something like

import re

txt = ...
regexp = "[0-9]{1,3} .+, .+, [A-Z]{2} [0-9]{5}"
address = re.findall(regexp, txt)

# address = ['44 West 22nd Street, New York, NY 12345']

Explanation:

[0-9]{1,3}: 1 to 3 digits, the address number

(space): a space between the number and the street name

.+: street name, any character for any number of occurrences

,: a comma and a space before the city

.+: city, any character for any number of occurrences

,: a comma and a space before the state

[A-Z]{2}: exactly 2 uppercase chars from A to Z

[0-9]{5}: 5 digits

re.findall(expr, string) will return an array with all the occurrences found.

Valediction answered 13/6, 2016 at 8:21 Comment(4)
Deep clear explanation. Where I can learn this regular expressions with detailsCharlot
Is there any way to detect the address from text like this using Node.js, not python @ValedictionTransmigrant
@Transmigrant it's the very same approach, just copy over the RegExpValediction
i always find regex101.com very helpfulOmega
N
6

Pyap works best not just for this particular example but also for other addresses contained in texts.

text = ...
addresses = pyap.parse(text, country='US')
Nonrestrictive answered 11/10, 2018 at 5:47 Comment(2)
For those finding this, as of mid-2022 this package hasnt been updated in a 2 years. Its a regex based approach and has the corresponding limitations.Gable
That said if you just want the regex logic:, heres a link to the US address regex logic: github.com/vladimarius/pyap/blob/master/pyap/source_US/data.pyGable
I
3

Checkout libpostal, a library dedicated to address extraction

It cannot extract address from raw text but may help in related tasks

Ichnology answered 14/12, 2018 at 0:51 Comment(2)
Libpostal is used for normalising strings that have already been identified as addresses, which is a completely different task.Referential
Yeah, libpostal is not really a solution for OPs question. It takes human-formatted addresses and makes them more "machine" readable. For extraction, check out LexNLP. It's not well documented but with a few dozen lines of code it does a damn good job. Where something like libpostal could help is by finding and correcting mistakes or adding missing data like postal codes. For this though, it's easy enough to use Google's Address Validation API which works extremely well. Where libpostal shines though is in its license. Google doesn't let you store returned data for example.Prentice
G
3

For US address extraction from bulk text:

For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.

Explanation:

  • ([0-9]{1,6}) - string of 1-5 digits to start off
  • (.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
  • (BIG LIST OF AMERICAN STATS AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
  • .{1,2} - designed to accomodate many permutations of ,/s or just /s between the state and the zip
  • ([0-9]{5}) - captures first 5 of the zip.

text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"

address_regex = r"([0-9]{1,5})(.{5,75})((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"

addresses = re.findall(address_regex, text)

addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]

You can combine these and remove spaces like so:

for address in addresses:
    out_address = " ".join(address)
    out_address = " ".join(out_address.split())

To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress

Gable answered 15/7, 2022 at 15:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.