Methods for extracting locations from text?
B

3

10

What are the recommended methods for extracting locations from free text?

What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?

Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.

Does anybody know of better approaches?

Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.

Borg answered 20/7, 2013 at 12:58 Comment(0)
F
11

All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)

This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml

You can easily find implementations in other programming languages.

Flutter answered 20/7, 2013 at 16:46 Comment(2)
I'm trying to extract locations from tweets text. Considering the high amount of tweets per second, I guess that would slow. Right?Borg
No. Training is slow and memory-consuming, but at runtime this is extremely efficient.Flutter
P
1

Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.

Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.

Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.

As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.

Patch answered 20/7, 2013 at 13:22 Comment(1)
Another possible problem could be that the first part of a multi-word location may be a location in itself. "Berlin" vs. "Berlin Heights, OH", for example.Patch
T
0

How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries? A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer. Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer. I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy

Trigonal answered 1/10, 2013 at 2:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.