Find all locations / cities / places in a text
Asked Answered
G

4

12

If I have a text containing for example an article of a newspaper in Catalan language, how could I find all cities from that text?

I have been looking at the package nltk for python and I have downloaded the corpus for catalan language (nltk.corpus.cess_cat).

What I have at this moment: I have installed all necessary from nltk.download(). An example of what I have at this moment:

te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.')

nltk.pos_tag(te)

The city is 'Sant Cugat del Valles'. What I get from the output is:

[('Tots', 'NNS'),
 ('els', 'NNS'),
 ('gats', 'NNS'),
 ('son', 'VBP'),
 ('de', 'IN'),
 ('Sant', 'NNP'),
 ('Cugat', 'NNP'),
 ('del', 'NN'),
 ('Valles', 'NNP')]

NNP seems to indicate nouns whose first letter is uppercase. Is there a way of getting places or cities and not all names? Thank you

Groupie answered 10/5, 2015 at 10:0 Comment(3)
Have you already tried anything? If so, where do you get stuck?Leith
I have installed all necessary from nltk.download(). An example of what I have at this moment: te = nltk.word_tokenize('Tots els gats son de Sant Cugat del Valles.') nltk.pos_tag(te) The city is 'Sant Cugat del Valles'. What I get from the output is: [('Tots', 'NNS'), ('els', 'NNS'), ('gats', 'NNS'), ('son', 'VBP'), ('de', 'IN'), ('Sant', 'NNP'), ('Cugat', 'NNP'), ('del', 'NN'), ('Valles', 'NNP')]Groupie
Should I get all NNP and that will be the cities and places? Or should I consider other commands from nltk? I answer this because NNP seems to indicate only nouns that begin with majus.Groupie
A
34

You can use the geotext python library for the same.

pip install geotext

is all it takes to install this library. The usage is as simple as:

from geotext import GeoText
places = GeoText("London is a great city")
places.cities

gives the result 'London'

The list of cities covered in this library is not extensive but it has a good list.

Antique answered 17/5, 2016 at 7:36 Comment(3)
Vote this solution up since it's easy to use. But city names in a sentence have to start with upper caseSkurnik
The geography package is also similar, but with more advanced functionalitiesIntarsia
The geography package linked above is no longer working, nor is the forked version (geography2). Geotext also hasn't had an update in 2+ years, so these answers aren't quite up to date.Adelleadelpho
L
8

You have either to train an named entity recognizer (NER) or you can make your own Gazetteer.

A simple Gazetteer I have made and use for tasks like yours is this one:

# -*- coding: utf-8 -*-
import codecs
from lxml.html.builder import DT
import os
import re

from nltk.chunk.util import conlltags2tree
from nltk.chunk import ChunkParserI
from nltk.tag import pos_tag
from nltk.tokenize import wordpunct_tokenize


def sub_leaves(tree, node):
    return [t.leaves() for t in tree.subtrees(lambda s: s.node == node)]


class Gazetteer(ChunkParserI):
    """
    Find and annotate a list of words that matches patterns.
    Patterns may be regular expressions in the form list of tuples.
    Every tuple has the regular expression and the iob tag for this one.
    Before applying gazetteer words a part of speech tagging should
    be performed. So, you have to pass your tagger as a parameter.
    Example:
        >>> patterns = [(u"Αθήνα[ς]?", "LOC"), (u"Νομική[ς]? [Σσ]χολή[ς]?", "ORG")]
        >>> gazetteer = Gazetteer(patterns, nltk.pos_tag, nltk.wordpunct_tokenize)
        >>> text = u"Η Νομική σχολή της Αθήνας"
        >>> t = gazetteer.parse(text)
        >>> print(unicode(t))
        ... (S Η/DT (ORG Νομική/NN σχολή/NN) της/DT (LOC Αθήνας/NN))
    """

    def __init__(self, patterns, pos_tagger, tokenizer):
        """
        Initialize the class.

        :param patterns:
            The patterns to search in text is a list of tuples with regular
            expression and the tag to apply
        :param pos_tagger:
            The tagger to use for applying part of speech to the text
        :param tokenizer:
            The tokenizer to use for tokenizing the text
        """
        self.patterns = patterns
        self.pos_tag = pos_tagger
        self.tokenize = tokenizer
        self.lookahead = 0  # how many words it is possible to be a gazetteer word
        self.words = []  # Keep the words found by applying the regular expressions
        self.iobtags = []  # For each set of words keep the coresponding tag

    def iob_tags(self, tagged_sent):
        """
        Search the tagged sentences for gazetteer words and apply their iob tags.

        :param tagged_sent:
            A tokenized text with part of speech tags
        :type tagged_sent: list
        :return:
            yields the IOB tag of the word with it's character, eg. B-LOCATION
        :rtype:
        """
        i = 0
        l = len(tagged_sent)
        inside = False  # marks the I- tag
        iobs = []

        while i < l:
            word, pos_tag = tagged_sent[i]
            j = i + 1  # the next word
            k = j + self.lookahead  # how many words in a row we may search
            nextwords, nexttags = [], []  # for now, just the ith word
            add_tag = False  # no tag, this is O

            while j <= k:
                words = ' '.join([word] + nextwords)  # expand our word list
                if words in self.words:  # search for words
                    index = self.words.index(words)  # keep index to use for iob tags
                    if inside:
                        iobs.append((word, pos_tag, 'I-' + self.iobtags[index]))  # use the index tag
                    else:
                        iobs.append((word, pos_tag, 'B-' + self.iobtags[index]))

                    for nword, ntag in zip(nextwords, nexttags):  # there was more than one word
                        iobs.append((nword, ntag, 'I-' + self.iobtags[index]))  # apply I- tag to all of them

                    add_tag, inside = True, True
                    i = j  # skip tagged words
                    break

                if j < l:  # we haven't reach the length of tagged sentences
                    nextword, nexttag = tagged_sent[j]  # get next word and it's tag
                    nextwords.append(nextword)
                    nexttags.append(nexttag)
                    j += 1
                else:
                    break

            if not add_tag:  # unkown words
                inside = False
                i += 1
                iobs.append((word, pos_tag, 'O'))  # it's an Outsider

        return iobs

    def parse(self, text, conlltags=True):
        """
        Given a text, applies tokenization, part of speech tagging and the
        gazetteer words with their tags. Returns an conll tree.

        :param text: The text to parse
        :type text: str
        :param conlltags:
        :type conlltags:
        :return: An conll tree
        :rtype:
        """
        # apply the regular expressions and find all the
        # gazetteer words in text
        for pattern, tag in self.patterns:
            words_found = set(re.findall(pattern, text))  # keep the unique words
            if len(words_found) > 0:
                for word in words_found:  # words_found may be more than one
                    self.words.append(word)  # keep the words
                    self.iobtags.append(tag)  # and their tag

        # find the pattern with the maximum words.
        # this will be the look ahead variable
        for word in self.words:  # don't care about tags now
            nwords = word.count(' ')
            if nwords > self.lookahead:
                self.lookahead = nwords

        # tokenize and apply part of speech tagging
        tagged_sent = self.pos_tag(self.tokenize(text))
        # find the iob tags
        iobs = self.iob_tags(tagged_sent)

        if conlltags:
            return conlltags2tree(iobs)
        else:
            return iobs


if __name__ == "__main__":
    patterns = [(u"Αθήνα[ς]?", "LOC"), (u"Νομική[ς]? [Σσ]χολή[ς]?", "ORG")]
    g = Gazetteer(patterns, pos_tag, wordpunct_tokenize)
    text = u"Η Νομική σχολή της Αθήνας"
    t = g.parse(text)
    print(unicode(t))


    dir_with_lists = "Lists"
    patterns = []
    tags = []
    for root, dirs, files in os.walk(dir_with_lists):
        for f in files:
            lines = codecs.open(os.path.join(root, f), 'r', 'utf-8').readlines()
            tag = os.path.splitext(f)[0]
            for l in lines[1:]:
                patterns.append((l.rstrip(), tag))
                tags.append(tag)

    text = codecs.open("sample.txt", 'r', "utf-8").read()
    #g = Gazetteer(patterns)
    t = g.parse(text.lower())
    print unicode(t)

    for tag in set(tags):
        for gaz_word in sub_leaves(t, tag):
            print gaz_word[0][0], tag

In the if __name__ == "__main__": you can see an example where I make patterns in the code patterns = [(u"Αθήνα[ς]?", "LOC"), (u"Νομική[ς]? [Σσ]χολή[ς]?", "ORG")].

Later in the code I read files from a directory named Lists (put it in the folder where you have the above code). The name of every file becomes the Gazetteer's tag. So, make files like LOC.txt with patterns for locations (LOC tag), PERSON.txt for Persons, etc.

Lucrative answered 10/5, 2015 at 11:50 Comment(0)
D
6

You don't need to use NLTK for this. Instead, do the following:

  1. Divide the text in to a list with all the words.
  2. Divide the cities into a dictionary where {"Sant Cugat del Valles":["Sant","Cugat","del","Valles"]}. It should be easily to find a list with all the cities in the area somewhere online or from sol local government.
  3. Iterate over the elements in the text in list form.

    3.1. Iterate over the cities if the element first element corresponds with the element in the text then check next element.

Here is a code example that is runnable:

text = 'Tots els gats son de Sant Cugat del Valles.'
#Prepare your text. Remove "." (and other unnecessary marks).
#Then split it into a list of words.
text = text.replace('.','').split(' ')      

#Insert the cities you want to search for.
cities =  {"Sant Cugat del Valles":["Sant","Cugat","del","Valles"]} 

found_match = False
for word in text:
    if found_match:        
        cityTest = cityTest
    else:
        cityTest = ''
    found_match = False
    for city in cities.keys():            
        if word in cities[city]:
            cityTest += word + ' '
            found_match = True        
        if cityTest.split(' ')[0:-1] == city.split(' '):
            print city    #Print if it found a city.
Dulcinea answered 10/5, 2015 at 11:25 Comment(1)
@Dulcinea I really found your answer helpful, however I just made one minor change in the code your provided. after print city I reset the variable like this found_match = False. Hope this helps others!Tsimshian
M
0

There’s a standard Linux program ‘fgrep’ that does this. Give it a file with a list of cities, one per line, and a second file to search (or stein), and it prints each line in the second file that contains any of the cities. There are switches to print only the matching text (just the city), or do case-independent matches, etc.

You can call fgrep directly from python.

Matejka answered 5/8, 2023 at 23:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.