I have 3000 manually labeled data set, divided into train and test set I have trained the NER model using SpaCy, to extract 8 custom entities like "ACTION", HIRE-DATE, STATUS etc... To evaluate the model I am using SpaCy Scorer.
There is no Accuracy metrics in the output, I am not sure which metric should I consider to decide whether the model performance is Good or Bad?
There are couple of cases where precision is low but the recall is 100 and f1 is also low eg:
'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},
in the above case what should be our conclusion?
Following is the full result of the Scorer, Where p=precision, r=recall and f=F1 score.... it has got overall performance and Entity wise performance.
{
'uas': 0.0,
'las': 0.0,
'ents_p': 86.40850417615793,
'ents_r': 97.93459552495698,
'ents_f': 91.81121419927389,
'ents_per_type': {'ACTION': {'p': 97.17682020802377,
'r': 97.61194029850746,
'f': 97.3938942665674},
'STATUS': {'p': 83.33333333333334,
'r': 96.3855421686747,
'f': 89.3854748603352},
'PED': {'p': 98.61751152073732,
'r': 99.53488372093024,
'f': 99.07407407407408},
'TERM-DATE': {'p': 83.52272727272727,
'r': 98.65771812080537,
'f': 90.46153846153847},
'LOCATION': {'p': 7.142857142857142, 'r': 100.0, 'f': 13.333333333333334},
'DOB': {'p': 10.0, 'r': 100.0, 'f': 18.181818181818183},
'RE-HIRE-DATE': {'p': 34.84848484848485,
'r': 100.0,
'f': 51.685393258426956},
'HIRE-DATE': {'p': 18.96551724137931, 'r': 100.0, 'f': 31.88405797101449},
'PED-CED': {'p': 100.0, 'r': 71.42857142857143, 'f': 83.33333333333333},
'CED': {'p': 100.0, 'r': 100.0, 'f': 100.0}},
'tags_acc': 0.0,
'token_acc': 100.0}
Kindly Suggest.