I am afraid, you will have to write your own conversion because IOB encoding depends on what tokenization will the pre-trained representation model (BERT, RoBERTa or whatever pre-trained model of your choice) uses.
The SpaCy format specifies the character span of the entity, i.e.
"Who is Shaka Khan?"[7:17]
will return "Shaka Khan"
. You need to match that to tokens used by the pre-trained model.
Here are examples of how different models tokenize the example sentence when you used Huggingface's Transformers.
- BERT:
['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
- RoBERTa:
['Who', '_is', '_Sh', 'aka', '_Khan', '?']
- XLNet:
['▁Who', '▁is', '▁Shak', 'a', '▁Khan', '?']
When knowing how the tokenizer work, you can implement the conversion. Something like this can work for BERT tokenization.
entities = [(7, 17, "PERSON")]}
tokenized = ['Who', 'is', 'S', '##hak', '##a', 'Khan', '?']
cur_start = 0
state = "O" # Outside
tags = []
for token in tokenized:
# Deal with BERT's way of encoding spaces
if token.startswith("##"):
token = token[2:]
else:
token = " " + token
cur_end = cur_start + len(token)
if state == "O" and cur_start < entities[0][0] < cur_end:
tags.append("B-" + entitites[0][2])
state = "I-" + entitites[0][2]
elif state.startswith("I-") and cur_start < entities[0][1] < cur_end:
tags.append(state)
state = "O"
entities.pop(0)
else:
tags.append(state)
cur_start = cur_end
Note that the snippet would break if one BERT token would contain the end of one entity and the start of the following one. The tokenizer also does not distinguish how many spaces (or other whitespaces) there were in the original string, this is a potential source of errors as well.