I am using a model from hugging face, specifically Davlan/distilbert-base-multilingual-cased-ner-hrl
. However, I am not able to extract full entity names from the result.
If I run the following code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
model = AutoModelForTokenClassification.from_pretrained("Davlan/distilbert-base-multilingual-cased-ner-hrl")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "My name is Johnathan Smith and I work at Apple"
ner_results = nlp(example, aggregation_strategy="max")
print(ner_results)
Then I get output:
[{'entity': 'B-PER', 'score': 0.9998949, 'index': 4, 'word': 'Johna', 'start': 11, 'end': 16}, {'entity': 'I-PER', 'score': 0.999726, 'index': 5, 'word': '##tha', 'start': 16, 'end': 19}, {'entity': 'I-PER', 'score': 0.9997751, 'index': 6, 'word': '##n', 'start': 19, 'end': 20}, {'entity': 'I-PER', 'score': 0.99974835, 'index': 7, 'word': 'Smith', 'start': 21, 'end': 26}, {'entity': 'B-ORG', 'score': 0.99870986, 'index': 12, 'word': 'Apple', 'start': 41, 'end': 46}]
It looks like I might be able to post process this so Jonathan Smith
is all one word. But ideally I would like this to be done for me and have no partial words identified.