is there a way with spaCy's NER to calculate metrics per entity type?
Asked Answered
P

6

13

is there a way in the NER model in spaCy to extract the metrics (precision, recall, f1 score) per entity type?

Something that will look like this:

         precision    recall  f1-score   support

  B-LOC      0.810     0.784     0.797      1084
  I-LOC      0.690     0.637     0.662       325
 B-MISC      0.731     0.569     0.640       339
 I-MISC      0.699     0.589     0.639       557
  B-ORG      0.807     0.832     0.820      1400
  I-ORG      0.852     0.786     0.818      1104
  B-PER      0.850     0.884     0.867       735
  I-PER      0.893     0.943     0.917       634

avg / total 0.809 0.787 0.796 6178

taken from: http://www.davidsbatista.net/blog/2018/05/09/Named_Entity_Evaluation/

Thank you!

Pollen answered 17/10, 2018 at 13:26 Comment(0)
B
7

Nice question.

First, we should clarify that spaCy uses the BILUO annotation scheme instead of the BIO annotation scheme you are referring to. From the spacy documentation the letters denote the following:

  • B: The first token of a multi-token entity.
  • I: An inner token of a multi-token entity.
  • L: The final token of a multi-token entity.
  • U: A single-token entity.
  • O: A non-entity token.

Then, some definitions:

definitions

Spacy has a built-in class to evaluate NER. It's called scorer. Scorer uses exact matching to evaluate NER. The precision score is returned as ents_p, the recall as ents_r and the F1 score as ents_f.

The only problem with that is that it returns the score for all the tags together in the document. However, we can call the function only with the TAG we want and get the desired result.

All together, the code should look like this:

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(nlp, examples, ent='PERSON'):
    scorer = Scorer()
    for input_, annot in examples:
        text_entities = []
        for entity in annot.get('entities'):
            if ent in entity:
                text_entities.append(entity)
        doc_gold_text = nlp.make_doc(input_)
        gold = GoldParse(doc_gold_text, entities=text_entities)
        pred_value = nlp(input_)
        scorer.score(pred_value, gold)
    return scorer.scores


examples = [
    ("Trump says he's answered Mueller's Russia inquiry questions \u2013 live",{"entities":[[0,5,"PERSON"],[25,32,"PERSON"],[35,41,"GPE"]]}),
    ("Alexander Zverev reaches ATP Finals semis then reminds Lendl who is boss",{"entities":[[0,16,"PERSON"],[55,60,"PERSON"]]}),
    ("Britain's worst landlord to take nine years to pay off string of fines",{"entities":[[0,7,"GPE"]]}),
    ("Tom Watson: people's vote more likely given weakness of May's position",{"entities":[[0,10,"PERSON"],[56,59,"PERSON"]]}),
]

nlp = spacy.load('en_core_web_sm')
results = evaluate(nlp, examples)
print(results)

Call the evaluate function with the proper ent parameter to get the results for each tag.

Hope it helps :)

Britten answered 16/11, 2018 at 21:57 Comment(2)
I think there is something wrong with this, when I run the evaluation on all entities, I get very good results (>90% for P, R and F) however when I filter the entities using your function the R remains high, but P and F drop to extremely low (below 20%) So I'm thinking the function is evaluating incorrectly at the line GoldParse(). Maybe its taking into account all entities in the first parameter?Ifni
@Britten I quote you "Scorer uses exact matching to evaluate NER" - How do you know this? You would help me by letting me know your source:) ThanksStirring
P
7

From spacy v3,

#Test the model

import spacy
from spacy.training.example import Example

nlp = spacy.load("./model_saved")
examples = []
data = [("Taj mahal is in Agra.", {"entities": [(0, 9, 'name'),
(16, 20, 'place')]})]
for text, annots in data:
    doc = nlp.make_doc(text)
    examples.append(Example.from_dict(doc, annots))
print(nlp.evaluate(examples)) # This will provide overall and per entity metrics
Planimetry answered 3/6, 2021 at 14:39 Comment(0)
S
5

I have been working on this, and now its integrated withing spacy by this Pull Request.

Now you just need to call Scorer().scores and it will return the usual dict with an additional key, ents_per_type, that will contains the metrics Precision, Recall and F1-Score for each entity.

Hope it helps!

Sound answered 9/7, 2019 at 19:4 Comment(2)
@Britten can you have a look at my question here:#58376713Adham
can you have look at my question here #58376713Adham
D
2

@gdaras 's answer is not right. The first comment gives the idea why. You should filter entities of

pred_value = nlp(input_)

I did it like this

pred_value.ents = [e for e in pred_value.ents if e.label_ == ent]
Dottiedottle answered 23/4, 2019 at 16:52 Comment(3)
@Ruwan, I did not get your point exactly , where should I add this?Adham
Use @Britten 's answer as a starting point. You should add my second line right after my first one (find it in the first answer). Use similar filtering for text_entities before gold = GoldParse(doc_gold_text, entities=text_entities)Dottiedottle
Can you rewrite the code chunk in your answer as it is confusing ?Unfeeling
A
0

Evaluating the NER model calculating the metric per entity type can be performed by using spacy's built-in scorer class.

It outputs dict with the key ents_per_type containing precision, recall, f1-score for each entity.

For spacy v3 supported data format for training (extracted from UBIAI),

valid_data = [
    {
        "documentName": "file.txt",
        "document": "Ram lives in Kathmandu, Nepal.",
        "annotation": [
            {
                "start": 0,
                "end": 3,
                "label": "PER",
                "text": "Ram",
                "propertiesList": [],
                "commentsList": []
            },
            ...
            ],
        "user_input": ""
    }]
import spacy
from spacy.training import Example
from spacy.scorer import Scorer

nlp = spacy.load("path_to_your_model")
scorer = Scorer()

examples = []
annots = []

for content in valid_data:
    predicted = nlp(content['document'])
    for annotate_content in content['annotation']:
        start = annotate_content['start']
        end = annotate_content['end']
        label = annotate_content['label']
        annots.append((start, end, label))
   
    final = {'entities': annots}
    
    example = Example.from_dict(predicted, final)
    examples.append(example)

scores = scorer.score(examples)
scores['ents_per_type']

Also, You can simply create a test dataset with the number of annotated data samples and run the following command to calculate the metrics per entity type.

!spacy evaluate path_to_your_model path_to_your_test_data
!spacy evaluate ./output/model-best ./data/test.spacy  
Alben answered 6/10, 2022 at 17:16 Comment(0)
P
-1

Yes the results come like this:

{'token_acc': 1.0, 'token_p': 1.0, 'token_r': 1.0, 'token_f': 1.0, 'ents_p': 0.8571428571428571, 'ents_r': 0.5454545454545454, 'ents_f': 0.6666666666666665, 'ents_per_type': {'KEY5': {'p': 0.8571428571428571, 'r': 0.5454545454545454, 'f': 0.6666666666666665}}, 'speed': 34577.72779537917}
Pallbearer answered 21/9, 2021 at 10:58 Comment(1)
This looks like a nice output put does not explain anything about how to make itGrouchy

© 2022 - 2024 — McMap. All rights reserved.