Extracting names from a text file using Spacy

Asked 24/7, 2018 at 4:46 Answered 17/4, 2022 at 17:44

python nlp nltk spacy named-entity-recognition

I have a text file which contains lines as shown below:

Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST

The patient was referred by Dr. Jacob Austin.  

Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST

Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST

The patient was referred by
Dr. Jayden Green Olivia.

I want to extract all names using Spacy. I am using Spacy's part of speech tagging and entity recognition but not able to get success. May I please know on how it could done? Any help would be appreciable

I am using some code in this way:

import spacy
nlp = spacy.load('en')
document_string= """ Electronically signed by stupid: Dr. John Douglas, M.D.; 
    Jun 13 2018 11:13AM CST"""
doc = nlp(document_string)
for sentence in doc.ents:
    print(sentence, sentence.label_)

Ununa answered 24/7, 2018 at 4:46 Comment(3)

Show us your code. Show examples where spacy is giving bad prediction – Croquet 24/7, 2018 at 8:56

@PradipPramanick this is my code: import spacy nlp = spacy.load('en') document_string= " Electronically signed by stupid: Dr. John Douglas, M.D.; Jun 13 2018 11:13AM CST" doc = nlp(document_string) for sentence in doc.ents: print(sentence, sentence.label_) – Ununa 24/7, 2018 at 14:4

Please put your code in your answer so it has line breaks and everything. – Wald 25/7, 2018 at 2:22

The issue with models accuracy

The problem with all models is that they don't have 100% accuracy and even using a bigger model doesn't help to recognize dates. Here are the accuracy values (F-score, precision, recall) for NER models--they are all around 86%.

document_string = """ 
Electronically signed : Wes Scott, M.D.; Jun 26 2010 11:10AM CST 
 The patient was referred by Dr. Jacob Austin.   
Electronically signed by Robert Clowson, M.D.; Janury 15 2015 11:13AM CST 
Electronically signed by Dr. John Douglas, M.D.; Jun 16 2017 11:13AM CST 
The patient was referred by 
Dr. Jayden Green Olivia.   
"""

With small model two date items are labelled as 'PERSON':

import spacy                                                                                                                            

nlp = spacy.load('en')                                                                                                                  
sents = nlp(document_string) 
 [ee for ee in sents.ents if ee.label_ == 'PERSON']                                                                                      
# Out:
# [Wes Scott,
#  Jun 26,
#  Jacob Austin,
#  Robert Clowson,
#  John Douglas,
#  Jun 16 2017,
#  Jayden Green Olivia]

With a larger model en_core_web_md the results are even worse in terms of precision, as there are three misclassified entities.

nlp = spacy.load('en_core_web_md')                                                                                                                  
sents = nlp(document_string) 
# Out:
#[Wes Scott,
# Jun 26,
# Jacob Austin,
# Robert Clowson,
# Janury,
# John Douglas,
# Jun 16 2017,
# Jayden Green Olivia]

I also tried other models (xx_ent_wiki_sm, en_core_web_md) and they don't bring any improvement as well.

What about using rules to improve accuracy?

In the small example not only the document seems to have a clear structure, but the misclassified entities are all dates. So why not combine the initial model with a rule-based component?

The good news is that in Spacy:

it's possible can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models

(from https://spacy.io/usage/rule-based-matching#models-rules)

So, by following the example and using the dateparser library (a parser for human readable dates) I've put together a rule-based component that works very well on this example:

from spacy.tokens import Span
import dateparser

def expand_person_entities(doc):
    new_ents = []
    for ent in doc.ents:
        # Only check for title if it's a person and not the first token
        if ent.label_ == "PERSON":
            if ent.start != 0:
                # if person preceded by title, include title in entity
                prev_token = doc[ent.start - 1]
                if prev_token.text in ("Dr", "Dr.", "Mr", "Mr.", "Ms", "Ms."):
                    new_ent = Span(doc, ent.start - 1, ent.end, label=ent.label)
                    new_ents.append(new_ent)
                else:
                    # if entity can be parsed as a date, it's not a person
                    if dateparser.parse(ent.text) is None:
                        new_ents.append(ent) 
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

# Add the component after the named entity recognizer
# nlp.remove_pipe('expand_person_entities')
nlp.add_pipe(expand_person_entities, after='ner')

doc = nlp(document_string)
[(ent.text, ent.label_) for ent in doc.ents if ent.label_=='PERSON']
# Out:
# [(‘Wes Scott', 'PERSON'),
#  ('Dr. Jacob Austin', 'PERSON'),
#  ('Robert Clowson', 'PERSON'),
#  ('Dr. John Douglas', 'PERSON'),
#  ('Dr. Jayden Green Olivia', 'PERSON')]

Bosomed answered 23/6, 2019 at 20:16 Comment(1)

Thanks so much for this, helped me! For newer versions of spacy (3.0.5) you need to add @Language.component("my_pipeline_component") before the function definition and then add it to the pipe like so: nlp.add_pipe("my_pipeline_component", after='ner') – Gerund 5/4, 2021 at 23:38

Try this:

import spacy
en = spacy.load('en')

sents = en(open('input.txt').read())
people = [ee for ee in sents.ents if ee.label_ == 'PERSON']

Wald answered 24/7, 2018 at 6:11 Comment(2)

@poim23 I have already tried this but it is also including Jun 26 as PERSON – Ununa 24/7, 2018 at 14:7

Try using a larger model. – Wald 25/7, 2018 at 2:21

Try this,it works fine,I'm using Jupyter

text=open('input.txt').read()

nlp = spacy.load("en_core_web_lg")
sents = nlp(open('input.txt').read()).to_json()

people=[ee for ee in sents['ents'] if ee['label'] == 'PERSON']                                                                                      
print(people)
for pps in people:
    print(text[pps['start']:pps['end']])

Roer answered 17/4, 2022 at 17:44 Comment(0)

The issue with models accuracy

What about using rules to improve accuracy?

Recommended topics

Hot tags