Evaluation in a Spacy NER model

Asked 29/6, 2017 at 14:27 Answered 20/1, 2023 at 0:49

I am trying to evaluate a trained NER Model created using spacy lib. Normally for these kind of problems you can use f1 score (a ratio between precision and recall). I could not find in the documentation an accuracy function for a trained NER model.

I am not sure if it's correct but I am trying to do it with the following way(example) and using f1_score from sklearn:

from sklearn.metrics import f1_score
import spacy
from spacy.gold import GoldParse


nlp = spacy.load("en") #load NER model
test_text = "my name is John" # text to test accuracy
doc_to_test = nlp(test_text) # transform the text to spacy doc format

# we create a golden doc where we know the tagged entity for the text to be tested
doc_gold_text= nlp.make_doc(test_text)
entity_offsets_of_gold_text = [(11, 15,"PERSON")]
gold = GoldParse(doc_gold_text, entities=entity_offsets_of_gold_text)

# bring the data in a format acceptable for sklearn f1 function
y_true = ["PERSON" if "PERSON" in x else 'O' for x in gold.ner]
y_predicted = [x.ent_type_ if x.ent_type_ !='' else 'O' for x in doc_to_test]
f1_score(y_true, y_predicted, average='macro')`[1]
> 1.0

Any thoughts are or insights are useful.

Claptrap answered 29/6, 2017 at 14:27 Comment(1)

dulaj.medium.com/… : check this link, good article to read for spacy evaluation. – Lenticular 6/4, 2022 at 7:27

You can find different metrics including F-score, recall and precision in spaCy/scorer.py.

This example shows how you can use it:

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(ner_model, examples):
    scorer = Scorer()
    for input_, annot in examples:
        doc_gold_text = ner_model.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=annot)
        pred_value = ner_model(input_)
        scorer.score(pred_value, gold)
    return scorer.scores

# example run

examples = [
    ('Who is Shaka Khan?',
     [(7, 17, 'PERSON')]),
    ('I like London and Berlin.',
     [(7, 13, 'LOC'), (18, 24, 'LOC')])
]

ner_model = spacy.load(ner_model_path) # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)

The scorer.scores returns multiple scores. When running the example, the result looks like this: (Note the low scores occuring because the examples classify London and Berlin as 'LOC' while the model classifies them as 'GPE'. You can figure this out by looking at the ents_per_type.)

{'uas': 0.0, 'las': 0.0, 'las_per_type': {'attr': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'root': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'compound': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'nsubj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'dobj': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'cc': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'conj': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'ents_p': 33.33333333333333, 'ents_r': 33.33333333333333, 'ents_f': 33.33333333333333, 'ents_per_type': {'PERSON': {'p': 100.0, 'r': 100.0, 'f': 100.0}, 'LOC': {'p': 0.0, 'r': 0.0, 'f': 0.0}, 'GPE': {'p': 0.0, 'r': 0.0, 'f': 0.0}}, 'tags_acc': 0.0, 'token_acc': 100.0, 'textcat_score': 0.0, 'textcats_per_cat': {}}

The example is taken from a spaCy example on github (link does not work anymore). It was last tested with spacy 2.2.4.

Claptrap answered 30/6, 2017 at 7:59 Comment(11)

1. you're github link is broken 2. What is self in this context? Where may I find self.make_gold? – Snaky 14/2, 2018 at 22:38

@Snaky I have updated the answer to be more clear. – Claptrap 15/2, 2018 at 10:25

Can you confirm that this is still working (for v2)? I get a KeyError at gold = GoldParse(doc_gold_text, entities=annot) – Broadax 18/3, 2018 at 14:35

I get an error at this point as well. gold = GoldParse(doc_gold_text, entities=_) File "gold.pyx", line 418, in spacy.gold.GoldParse.__init__ KeyError: 0 – Insurrectionary 3/5, 2018 at 12:4

I looked on the site and it looks like this has changed with the new version. spacy.io/usage/v2 . I haven't found a solution yet. – Insurrectionary 3/5, 2018 at 12:28

Remember to import the scorer class from spacy.scorer import Scorer – Arsenical 31/5, 2018 at 18:59

@EvanLalo make sure that annot is an iterable tuples, not a dictionary. I ran into the same issue. – Arsenical 31/5, 2018 at 20:39

Try this entities=annot['entities'] instead of the default entities=annot. – Sibling 24/10, 2018 at 6:45

How do you compute PRF metrics for each category ('PERSON', 'GPE', 'LOC", etc) separately? – Sawfly 16/12, 2018 at 23:51

@ArnoldKlein In spacy v2.1.5, it now supports PRF per entity type. spacy.io/api/scorer – Leverage 13/8, 2019 at 20:25

for spacy v3: Execute the evaluate command on cli as mentioned in spacy.io/api/cli#evaluate – Marela 24/2, 2022 at 10:35

since i faced the same problem, i am going to post here the code for the example showed in the accepted answer, but for spacy V3:

import spacy
from spacy.scorer import Scorer
from spacy.tokens import Doc
from spacy.training.example import Example

examples = [
    ('Who is Shaka Khan?',
     {(7, 17, 'PERSON')}),
    ('I like London and Berlin.',
     {(7, 13, 'LOC'), (18, 24, 'LOC')})
]

def evaluate(ner_model, examples):
    scorer = Scorer()
    example = []
    for input_, annot in examples:
        pred = ner_model(input_)
        print(pred,annot)
        temp = Example.from_dict(pred, dict.fromkeys(annot))
        example.append(temp)
    scores = scorer.score(example)
    return scores

ner_model = spacy.load('en_core_web_sm') # for spaCy's pretrained use 'en_core_web_sm'
results = evaluate(ner_model, examples)
print(results)

Breaking changes ocurred because libraries such as goldParse deprecated

I believe the part of the answer about metrics is still valid

Amygdaloid answered 12/7, 2021 at 10:47 Comment(0)

Note that in spaCy v3 there is an evaluate command you can use easily from the command line instead of writing custom code to handle things.

Rotogravure answered 12/7, 2021 at 10:54 Comment(0)

This is how I used to calculate accuracy for my Spacy's Custom NER model

def flat_accuracy(text, annotations):
    actual_ents = [ents[2] for ents in annotations]
    prediction = nlp_ner(text)
    pred_ents = [ent.text for ent in prediction.ents]
    return 1 if actual_ents == pred_ents else 0


predict_points = sum(flat_accuracy(test_text[0], test_text[1]) for test_text in examples)
output = (predict_points/len(examples)) * 100
output --> 82%

Dissimulate answered 10/7, 2022 at 16:56 Comment(0)

I searched for many solutions on the internet but failed to find any working solution. Now that I was able to figure out the root of the problem, I am sharing my code, similar to the original question. I hope someone can still find it useful. It works with SpaCy V3.3.

from spacy.scorer import Scorer
from spacy.training import Example

def evaluate(ner_model, samples):
    scorer = Scorer(ner_model)
    example = []
    for sample in samples:
        pred = ner_model(sample['text'])
        print(pred, sample['entities'])
        temp_ex = Example.from_dict(pred, {'entities': sample['entities']})
        example.append(temp_ex)
    scores = scorer.score(example)
    
    return scores

Note: samples should be a valid spacy v3 formatted JSON data like below:

{'text': '#Causes - Quinsy - CA0K.1\nPeri Tonsillar Abscess is usually a complication of an untreated or partially treated acute tonsillitis. The infection, in these cases, spreads to the peritonsillar area (peritonsillitis). This region comprises loose connective tissue and is hence susceptible to formation of abscess.', 'entities': [(10, 16, 'Disease_E'), (26, 48, 'Disease_E'), (112, 129, 'Complication_E'), (177, 213, 'Anatomy_E'), (237, 260, 'Anatomy_E'), (302, 309, 'Disease_E')]}

Hart answered 20/1, 2023 at 0:49 Comment(0)

Recommended topics

Hot tags