What is the good metric to evaluate NER model trained in Spacy
Asked Answered
T

1

5

I have 3000 manually labeled data set, divided into train and test set I have trained the NER model using SpaCy, to extract 8 custom entities like "ACTION", HIRE-DATE, STATUS etc... To evaluate the model I am using SpaCy Scorer.

There is no Accuracy metrics in the output, I am not sure which metric should I consider to decide whether the model performance is Good or Bad?

There are couple of cases where precision is low but the recall is 100 and f1 is also low eg:

'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},

in the above case what should be our conclusion?

Following is the full result of the Scorer, Where p=precision, r=recall and f=F1 score.... it has got overall performance and Entity wise performance.

{
'uas': 0.0,
 'las': 0.0,
 'ents_p': 86.40850417615793,
 'ents_r': 97.93459552495698,
 'ents_f': 91.81121419927389,
 'ents_per_type': {'ACTION': {'p': 97.17682020802377,
   'r': 97.61194029850746,
   'f': 97.3938942665674},
  'STATUS': {'p': 83.33333333333334,
   'r': 96.3855421686747,
   'f': 89.3854748603352},
  'PED': {'p': 98.61751152073732,
   'r': 99.53488372093024,
   'f': 99.07407407407408},
  'TERM-DATE': {'p': 83.52272727272727,
   'r': 98.65771812080537,
   'f': 90.46153846153847},
  'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},
  'DOB': {'p': 10.0, 'r': 100.0, 'f': 18.181818181818183},
  'RE-HIRE-DATE': {'p': 34.84848484848485,
   'r': 100.0,
   'f': 51.685393258426956},
  'HIRE-DATE': {'p': 18.96551724137931, 'r': 100.0, 'f': 31.88405797101449},
  'PED-CED': {'p': 100.0, 'r': 71.42857142857143, 'f': 83.33333333333333},
  'CED': {'p': 100.0, 'r': 100.0, 'f': 100.0}},
 'tags_acc': 0.0,
 'token_acc': 100.0}

Kindly Suggest.

Torrid answered 1/9, 2019 at 16:54 Comment(0)
P
6

It depends on your application. What's worse: missing an entity, or wrongly flagging something as an entity? If failing to label an entity (false negative) is bad, then you care about recall. If wrongly flagging a non-entity as an entity (false positive) is bad, you care about precision. If you care about both precision and recall the same, use F_1. If you care about precision (false positives) twice as much as recall (false negatives), use F_0.5. You can do F_b for any b to express what you care about. The formula is shown and explained on the Wikipedia page for F Score

Edit: answering the direct question from the original post:

The system does badly at LOCATION and the 3 date entities. The others look good. If it were me, I would try to use NER to extract all dates as one entity, then try to build a separate system, rule based or a classifier, for distinguishing between the different kinds of dates. For location, you could use a system that focuses on just geo-parsing, such as Mordecai.

Phillipp answered 8/9, 2019 at 22:57 Comment(7)
Thanks Sam for the explanation. In my case it is ok if model fail to label the entity, so I should not care about recall. But I have one confusion in the second case, where in model wrongly flag non-entity as entity, In my case there are cases where model wrongly label the entity. for example the date of birth gets labeled as HIRE-DATE, is this fall under this 2nd category? (false positive)? I am bit confuse here, Can you please elaborate more on this? Frankly I found this precision/recall concepts bit hard to understand :)Torrid
Yes, what you describe is a false positive, so that lowered it's precision. I see the precision of HIRE-DATE is 18%, which is very lowPhillipp
Precision and recall are hard for everyone at first. But false negatives and positives are easier to understand, and if you think "fewer false positives -> higher precision, fewer false negatives -> higher recall" that is usually good enoughPhillipp
@Torrid - Did this answer your question? If so, consider accepting this answer. If not, please comment again to let me know what is missing.Phillipp
Thanks Sam. it is really useful to making my understanding... One more thing I would like to clarify which I implemented after I post this question... I have computed the Confidence of the Model using " nlp2.entity.beam_parse" of SpaCy. What I observed is in few cases where even if the Entity is correctly labeled, the confidence is very low say 43% for example.. When I investigated it, I found that there are very few occurrences of that particular entities in training data.. is this the correct conclusion? What are the other factors makes model less confident?Torrid
I think your interpretation is right. In the CoNLL2013 dataset, which most NER systems use, there are ~5000 of each entity type. So if you have fewer than that, you probably want more labelled examples.Phillipp
How DO you add metrics to spacy train commands?Birkner

© 2022 - 2024 — McMap. All rights reserved.