How to reconstruct text entities with Hugging Face's transformers pipelines without IOB tags?

Asked 30/3, 2020 at 18:58 Answered 15/6 at 19:18

Solved nlp tokenize transformer-model named-entity-recognition huggingface-transformers

I've been looking to use Hugging Face's Pipelines for NER (named entity recognition). However, it is returning the entity labels in inside-outside-beginning (IOB) format but without the IOB labels. So I'm not able to map the output of the pipeline back to my original text. Moreover, the outputs are masked in BERT tokenization format (the default model is BERT-large).

For example:

from transformers import pipeline
nlp_bert_lg = pipeline('ner')
print(nlp_bert_lg('Hugging Face is a French company based in New York.'))

The output is:

[{'word': 'Hu', 'score': 0.9968873858451843, 'entity': 'I-ORG'},
{'word': '##gging', 'score': 0.9329522848129272, 'entity': 'I-ORG'},
{'word': 'Face', 'score': 0.9781811237335205, 'entity': 'I-ORG'},
{'word': 'French', 'score': 0.9981815814971924, 'entity': 'I-MISC'},
{'word': 'New', 'score': 0.9987512826919556, 'entity': 'I-LOC'},
{'word': 'York', 'score': 0.9976728558540344, 'entity': 'I-LOC'}]

As you can see, New York is broken up into two tags.

How can I map Hugging Face's NER Pipeline back to my original text?

Transformers version: 2.7

Dahl answered 30/3, 2020 at 18:58 Comment(2)

Can you please provide a full minimal reproducible example, including how you loaded your nlp_bert_lg model? – Adventure 31/3, 2020 at 10:19

Added @Adventure – Dahl 31/3, 2020 at 14:13

EDIT 12/2023: As pointed out, the grouped_entities parameter has been deprecated. The correct way is to use the aggregation_strategy parameters as pointed in the source code . For instance:

text = 'Hugging Face is a French company based in New York.'
tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)

Gives the following output

[
   {
      "entity_group":"ORG",
      "score":0.96934015,
      "word":"Hugging Face",
      "start":0,
      "end":12
   },
   {
      "entity_group":"MISC",
      "score":0.9981816,
      "word":"French",
      "start":18,
      "end":24
   },
   {
      "entity_group":"LOC",
      "score":0.9982121,
      "word":"New York",
      "start":42,
      "end":50
   }
]

ORIGINAL ANSWER: The 17th of May, a new pull request https://github.com/huggingface/transformers/pull/3957 with what you are asking for has been merged, therefore now our life is way easier, you can you it in the pipeline like

ner = pipeline('ner', grouped_entities=True)

and your output will be as expected. At the moment you have to install from the master branch since there is no new release yet. You can do it via

pip install git+git://github.com/huggingface/transformers.git@48c3a70b4eaedab1dd9ad49990cfaa4d6cb8f6a0

Margarine answered 20/5, 2020 at 9:7 Comment(3)

Thanks, that was super helpful. Note that the argument is now named aggregation_strategy – Hugely 15/10, 2021 at 15:1

WARNING: aggregation_strategy='max' or aggregation_strategy='average' relies on softmax probabilities --> these scores are not proper confidence scores! See: stats.stackexchange.com/questions/309642/… – Tetrabasic 6/5, 2022 at 10:42

token_classifier = pipeline("ner", model=model,aggregation_strategy='simple', tokenizer=tokenizer,grouped_entities=True)

{'entity_group': 'B_name', 'score': 0.96226656, 'word': 'Pratik', 'start': 1141, 'end': 1149} {'entity_group': 'I_name', 'score': 0.9272271, 'word': 'kr', 'start': 1150, 'end': 1157} {'entity_group': 'L_name', 'score': 0.7290683, 'word': 'kumar', 'start': 1158, 'end': 1163}

Ideally, it should be grouped to just name ? How to achieve this? – Mariomariology 11/10, 2022 at 11:11

If you're looking at this in 2022:

the grouped_entities keyword is now deprecated
you should use aggregation_strategy: default is None, you're looking for simple or first or average or max -> see documentation of the AggregationStrategy class

from transformers import pipeline
import pandas as pd

text = 'Hugging Face is a French company based in New York.'

tagger = pipeline(task='ner', aggregation_strategy='simple')
named_ents = tagger(text)
pd.DataFrame(named_ents)

[{'entity_group': 'ORG',
  'score': 0.96934015,
  'word': 'Hugging Face',
  'start': 0,
  'end': 12},
 {'entity_group': 'MISC',
  'score': 0.9981816,
  'word': 'French',
  'start': 18,
  'end': 24},
 {'entity_group': 'LOC',
  'score': 0.9982121,
  'word': 'New York',
  'start': 42,
  'end': 50}]

Perpetuate answered 25/3, 2022 at 14:40 Comment(1)

Unfortunately, as of now (version 2.6, and I think even with 2.7), you cannot do that with the pipeline feature alone. Since the __call__ function invoked by the pipeline is just returning a list, see the code here. This means you'd have to do a second tokenization step with an "external" tokenizer, which defies the purpose of the pipelines altogether.

But, instead, you can make use of the second example posted on the documentation, just below the sample similar to yours. For the sake of future completeness, here is the code:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

This is returning exactly what you are looking for. Note that the ConLL annotation scheme lists the following in its original paper:

Each line contains four fields: the word, its part-of-speech tag, its chunk tag and its named entity tag. Words tagged with O are outside of named entities and the I-XXX tag is used for words inside a named entity of type XXX. Whenever two entities of type XXX are immediately next to each other, the first word of the second entity will be tagged B-XXX in order to show that it starts another entity. The data contains entities of four types: persons (PER),organizations (ORG), locations (LOC) and miscellaneous names (MISC). This tagging scheme is the IOB scheme originally put forward by Ramshaw and Marcus (1995).

Meaning, if you are unhappy with the (still split) entities, you can concatenate all the subsequent I- tagged entities, or B- followed by I- tags. It is not possible in this scheme that two different (immediately neighboring) entities are both tagged with only the I- tags.

Adventure answered 1/4, 2020 at 8:41 Comment(2)

Thanks for the answer! Do you know when it splits the words by ## ? – Costume 2/9, 2020 at 22:37

Words are split based on the notion of "subword units", see for example this article. Essentially, it forces a smaller vocabulary, while still being able to reproduce rare words by combining different subwords, which are indicated by the ##. I.e., there might be subwords byte, -, and pair, but not byte-pair, so it would instead be represented by three "combined" tokens byte, ##-, ##pair. The full vocab is defined by the model itself, see the vocab.txt file. – Adventure 3/9, 2020 at 9:21

As of 2024, what i found is using aggregation_strategy param in pipeline link for NER pipeline
you can either use first, average or max ---> operations on score of the NER predicted (softmax outputs)

I have provided a small example below with aggregation_strategy as average

# imports

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
import warnings
warnings.filterwarnings("ignore")
import torch
model_id = "dslim/bert-base-NER" # hugging face

tokenizer_ner = AutoTokenizer.from_pretrained(model_id) # hugging face model

ner_model = AutoModelForTokenClassification.from_pretrained(model_id)

nlp = pipeline('ner',
              model = ner_model,
              tokenizer = tokenizer_ner,
              aggregation_strategy = 'average',
              device= None)

x = nlp('Hugging Face is a French company based in New York.')

print(x)

#output
#[{'entity_group': 'ORG', 'score': 0.72114974, 'word': 'Hugging Face', 'start': 0, 'end': 12}, {'entity_group': 'MISC', 'score': 0.99963593, 'word': 'French', 'start': 18, 'end': 24}, {'entity_group': 'LOC', 'score': 0.99922967, 'word': 'New York', 'start': 42, 'end': 50}]

Note : if aggregation_strategy = None, then you get raw outputs as mentioned in Q and my transformer version is 4.41.1

Dunford answered 15/6 at 19:18 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags