how to speed up NE recognition with stanford NER with python nltk
Asked Answered
J

4

10

First I tokenize the file content into sentences and then call Stanford NER on each of the sentences. But this process is really slow. I know if I call it on the whole file content if would be faster, but I'm calling it on each sentence as I want to index each sentence before and after NE recognition.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER

This is probably due to calling st.tag() for each sentence, but is there any way to make it run faster?

EDIT

The reason that I want to tag sentences separate is that I want to write sentences to a file (like sentence indexing) so that given the ne tagged sentence at a later stage, i can get the unprocessed sentence (i'm also doing lemmatizing here)

file format:

(sent_number, orig_sentence, NE_and_lemmatized_sentence)

Jame answered 17/11, 2015 at 3:17 Comment(0)
T
8

From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
Tao answered 17/11, 2015 at 4:24 Comment(5)
See also #33677026Tao
I don't think this would do what I want as it will output tagged sentences. what i want is the original sentence and ne tagged sentence pairs, which I will write in to a file in the below format:Jame
(sent_no, orig_sent, tagged_sent) e.g., 0, a new doc to the royal womens hospital, a new doc to the royal_womens_hospital. I don't think your answer allows me to do that?Jame
Just iterate through the filecontent that contains only the input sentence. do some data cleaning before that and the code would work. Otherwise can you post a sample of your input file. The file format doesn't help, a sample would help us understand the question better =)Tao
I just found a strange case where this doesn't work when processing my data. pls have a look into this post? #33755592Jame
P
9

you can use stanford ner server. The speed will be much faster.

install sner

pip install sner

run ner server

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))

this code result is

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]

more detail to look https://github.com/caihaoyu/sner

Prakrit answered 2/4, 2017 at 7:31 Comment(0)
T
8

From StanfordNERTagger, there is the tag_sents() function, see https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
Tao answered 17/11, 2015 at 4:24 Comment(5)
See also #33677026Tao
I don't think this would do what I want as it will output tagged sentences. what i want is the original sentence and ne tagged sentence pairs, which I will write in to a file in the below format:Jame
(sent_no, orig_sent, tagged_sent) e.g., 0, a new doc to the royal womens hospital, a new doc to the royal_womens_hospital. I don't think your answer allows me to do that?Jame
Just iterate through the filecontent that contains only the input sentence. do some data cleaning before that and the code would work. Otherwise can you post a sample of your input file. The file format doesn't help, a sample would help us understand the question better =)Tao
I just found a strange case where this doesn't work when processing my data. pls have a look into this post? #33755592Jame
C
1

First download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml

Lets say you put the download at /User/username/stanford-corenlp-full-2015-04-20

This Python code will run the pipeline:

stanford_distribution_dir = "/User/username/stanford-corenlp-full-2015-04-20"
list_of_sentences_path = "/Users/username/list_of_sentences.txt"
stanford_command = "cd %s ; java -Xmx2g -cp \"*\" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner -ssplit.eolonly -filelist %s -outputFormat json" % (stanford_distribution_dir, list_of_sentences_path)
os.system(stanford_command)

Here is some sample Python code for loading in a .json file for reference:

import json
sample_json = json.loads(file("sample_file.txt.json").read()

At this point sample_json will be a nice dictionary with all the sentences from the file in it.

for sentence in sample_json["sentences"]:
  tokens = []
  ner_tags = []
  for token in sentence["tokens"]:
    tokens.append(token["word"])
    ner_tags.append(token["ner"])
  print (tokens, ner_tags)

list_of_sentences.txt should be your list of files with sentences, something like:

input_file_1.txt
input_file_2.txt
...
input_file_100.txt

So input_file.txt (which should have one sentence per line) will generate input_file.txt.json once the Java command is run and that .json files will have the NER tags. You can just load the .json for each input file and easily get (sentence, ner tag sequence) pairs. You can experiment with "text" as an alternative output format if you like that better. But "json" will create a nice .json file that you can load in with json.loads(...) and then you'll have a nice dictionary that you can use to access the sentences and annotations.

This way you'll only load the pipeline once for all the files.

Creator answered 17/11, 2015 at 6:7 Comment(0)
R
0

After attempting several options, I like Stanza. It is developed by Stanford, is very simple to implement, I didn't have to figure out how to start the server properly on my own, and it dramatically improved the speed of my program. It implements the 18 different object classifications.

I found Stanza as I was searching through the documentation.

To download: pip install stanza

then in Python:

import stanza
stanza.download('en') # download English model
nlp = stanza.Pipeline('en') # initialize English neural pipeline
doc = nlp("My name is John Doe.") # run annotation over a sentence or multiple sentences

If you only want a specific tool (NER), you can specify with processors as: nlp = stanza.Pipeline('en',processors='tokenize,ner')

For an output similar to that produced by the OP:

classified_text = [(token.text,token.ner) for i, sentence in enumerate(doc.sentences) for token in sentence.tokens]
print(classified_text)
[('My', 'O'), ('name', 'O'), ('is', 'O'), ('John', 'B-PERSON'), ('Doe', 'E-PERSON')]

But to produce a list of only those words that are recognizable entities:

classified_text = [(ent.text,ent.type) for ent in doc.ents]
[('John Doe', 'PERSON')]

It produces a couple of features that I really like:

  1. You can access each sentence with doc.sentences.
  2. Instead of each word being classified as a separate person entity, it combines John Doe into one 'PERSON' object.
  3. If you do want each separate word, you can extract those and it identifies which part of the object it is ('B' for the first word in the object, 'I' for the intermediate words, and 'E' for the last word in the object)
Rashad answered 2/7, 2021 at 18:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.