Extracting nationalities and countries from text

Asked 17/6, 2016 at 16:44 Answered 22/5, 2019 at 5:38

I want to extract all country and nationality mentions from text using nltk, I used POS tagging to extract all GPE labeled tokens but the results were not satisfying.

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

The results obtained are :

['Thyroid', 'Australian', 'Caucasian', 'Graves']

Some are nationalities but others are just nouns.

So what am I doing wrong or is there another way to extract such info?

Chokedamp answered 17/6, 2016 at 16:44 Comment(2)

There's nothing wrong with you did. You performed entity extraction and then took the entity chunks and searched for GPE label in them. The reason you are not happy with NLTK results is because NLTK has generally poor performance with respect to classifying entities. There are lookup tables available for GPEs .They are pretty comprehensive and very efficient. Use them instead of relying on NLTK. – Effortful 21/6, 2016 at 11:22

Thank you, can you give me an example of those lookup tables... – Chokedamp 22/6, 2016 at 12:34

So after the fruitful comments, I digged deeper into different NER tools to find the best in recognizing nationalities and country mentions and found that SPACY has a NORP entity that extracts nationalities efficiently. https://spacy.io/docs/usage/entity-recognition

Chokedamp answered 22/6, 2016 at 12:40 Comment(2)

sPacy is fantastic and really powerful. I also recommend fooling around with Alchemy API as well. Though for large data it's preferable to use sPacy as it does not impose transaction cost for every query and result. – Effortful 23/6, 2016 at 5:56

As we know, spacy will tag locations as {GPE}. In my case I have two locations marked as GPE (e.g India, Delhi). Now my goal is to identify which one is city and the country. Please comment @Renaud – Hoskinson 4/12, 2017 at 7:9

Here's geograpy that uses NLTK to perform entity extraction. It stores all places and locations as a gazetteer. It then performs a lookup on the gazetteer to fetch relevant places and locations. Look up the docs for more usage details -

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

Effortful answered 21/6, 2016 at 11:13 Comment(6)

I actually tried to install geograpy but failed.. this is is why I relied on the nltk. – Chokedamp 22/6, 2016 at 12:35

Same issue with me couldn't install geograpy :( – Grasmere 20/3, 2017 at 18:59

Please install NLTK before you install geography, Or you can do pip install geograpy-nltk – Effortful 21/3, 2017 at 7:39

For geograpy, this worked for me: #31173219 – Awestricken 26/3, 2017 at 23:32

@OwaisKureshi pip install --upgrade html5lib==1.0b8 and then install geograpy – Vittorio 2/5, 2017 at 5:31

old but for python3 use - pip3 install geograpy3 – Triviality 4/11, 2019 at 5:45

If you want the country names to be extracted, what you need is NER tagger, not POS tagger.

Named-entity recognition (NER) is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Check out Stanford NER tagger!

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

Aeneus answered 18/6, 2016 at 7:12 Comment(2)

He's already performed entity extraction!! Unknowingly perhaps. – Effortful 21/6, 2016 at 10:58

Your answer just gives him a list of classified words.You do not even provide him with a list of GPEs. Please edit your answer – Effortful 21/6, 2016 at 11:20

You can use Spacy for NER. It gives better results that NLTK.

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"Apple is opening its first big office in San Francisco and California.")
print([(ent.text, ent.label_) for ent in doc.ents])

Fala answered 22/5, 2019 at 5:38 Comment(0)

Recommended topics

Hot tags