Convert NER SpaCy format to IOB format
Asked Answered
M

5

10

I have data which is already labelled in SpaCy format. For example:

("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]})

But I want to try training it with any other NER model, such as BERT-NER, which requires IOB tagging instead. Is there any conversion code from SpaCy data format to IOB?

Thanks!

Mime answered 14/1, 2020 at 9:35 Comment(1)
Hi, I am recently working with SpaCy NER tagging. I have a dataset which is already labelled in SpaCy format. Now I want use Spacy transformer (roBerta) for NER task. Do I need to convert Spacy NER tag file to IOB format?Spy
J
8

This is closely related to and mostly copied from https://mcmap.net/q/1162936/-converting-spacy-training-data-format-to-spacy-cli-format-for-blank-ner, see the notes in the comments there, too:

import spacy
from spacy.gold import biluo_tags_from_offsets

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    # then convert L->I and U->B to have IOB tags for the tokens in the doc
Jenicejeniece answered 14/1, 2020 at 17:34 Comment(0)
N
6

I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.

The SpaCy format specifies the character span of the entity, i.e.

"Who is Shaka Khan?"[7:17]

will return "Shaka Khan". You need to match that to tokens used by the pre-trained model.

Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.

  • BERT: ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
  • RoBERTa: ['Who', '_is', '_Sh', 'aka', '_Khan', '?']
  • XLNet: ['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']

When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.

entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']

cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
    # Deal with BERT's way of encoding spaces
    if token.startswith("##"):
        token = token[2:]
    else:
        token = " " + token

    cur_end = cur_start + len(token)
    if state == "O" and cur_start < entities[0][0] < cur_end:
        tags.append("B-" + entitites[0][2])
        state = "I-" + entitites[0][2]
    elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
        tags.append(state)
        state = "O"
        entities.pop(0)
    else:
        tags.append(state)
    cur_start = cur_end

Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.

Naturalistic answered 14/1, 2020 at 12:0 Comment(0)
N
1

First You need to convert your annotated json file to csv.
Then you can run the below code to convert into spaCy V2 Binary format

df = pd.read_csv('SC_CSV.csv')
l1 = []
l2 = []

for i in range(0, len(df['ner'])):
    l1.append(df['ner'][i])
    l2.append({"entities":[(0,len(df['ner'][i]),df['label'][i])]})
    
TRAIN_DATA = list(zip(l1, l2))
TRAIN_DATA 

Now the TRAIN_DATA in spaCy V2 format

This helps to convert the file from your old Spacy v2 formats to the brand new Spacy v3 format.

import pandas as pd
from tqdm import tqdm
import spacy
from spacy.tokens import DocBin

nlp = spacy.blank("en") # load a new spacy model
db = DocBin() # create a DocBin object

for text, annot in tqdm(TRAIN_DATA): # data in previous format
    doc = nlp.make_doc(text) # create doc object from text
    ents = []
    for start, end, label in annot["entities"]: # add character indexes
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    doc.ents = ents # label the text with the ents
    db.add(doc)

db.to_disk("./train.spacy") # save the docbin object
Nofretete answered 6/8, 2021 at 4:19 Comment(0)
L
1

I have faced this kind of problem. what i did is transforming the data to spacy binary then I load the data from docbin object using this code.

import spacy
from spacy.tokens import DocBin
db=DocBin().from_disk("your_docbin_name.spacy")
nlp=spacy.blank("language_used")
Documents=list(db.get_docs(nlp.vocab))

` then this code may help you to extract the iob format from it.

for elem in Documents[0]:
    if(elem.ent_iob_!="O"):
        print(elem.text,elem.ent_iob_,"-",elem.ent_type_)
    else :
        print(elem.text,elem.ent_iob_)

here is the example of my output :

عبرت O
الديناميكية B - POLITIQUE
النسوية I - POLITIQUE
التي O
تأسست O
بعد O
25 O
جويلية O
2021 O
عن O
رفضها O
القطعي O
لمشروع O
تنقيح B - POLITIQUE
المرسوم B - POLITIQUE
عدد O
88 O
لسنة O
Lewendal answered 9/4, 2022 at 9:47 Comment(1)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Manhood
B
1
import spacy
from spacy.gold import biluo_tags_from_offsets

data = data
nlp = spacy.blank("en")

for text, labels in data:
    doc = nlp("read our spacy format data here")
    ents = []

    for start, end, label in labels["entities"]:
            ents.append(doc.char_span(start, end, label))
    doc.ents = ents   
     
    for tok in doc:
        label = tok.ent_iob_
        if tok.ent_iob_ != "O":
            label += '-' + tok.ent_type_
        print(tok, label, sep="\t")

if getting none-type error do add try block depending on your dataset or clean your dataset.

Bohunk answered 12/4, 2022 at 18:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.