Converting Spacy Training Data format to Spacy CLI Format (for blank NER)
Asked Answered
S

2

4

This is the classic training format.

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

I used to train with code but as I understand, the training is better with CLI train method. However, my format is this.

I have found code-snippets for this type of conversion but every one of them is performing spacy.load('en') rather than going with blank - which made me think, are they training existing model rather than blank?

This chunk seems pretty easy:

import spacy
from spacy.gold import docs_to_json
import srsly

nlp = spacy.load('en', disable=["ner"]) # as you see it's loading 'en' which I don't have
TRAIN_DATA = #data from above

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

Running this code throws me: Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I am quite confused how to use it with spacy train on blank. Just use spacy.blank('en')? But then what about disable=["ner"] flag?

Edit:

If I try spacy.blank('en') instead, i receive Can't import language goal from spacy.lang: No module named 'spacy.lang.en'

Edit 2: I have tried loading en_core_web_sm

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
    docs.append(doc)

srsly.write_json("ent_train_data.json", [docs_to_json(docs)])

TypeError: object of type 'NoneType' has no len()

Ailton - print(text[start:end])

Goal! FK Qarabag 1, Partizani Tirana 0. Filip Ozobic - FK Qarabag - shot with the head from the centre of the box to the centre of the goal. Assist - Ailton - print(text)

None - doc.ents =... line

TypeError: object of type 'NoneType' has no len()

Edit 3: From Ines' comment

nlp = spacy.load('en_core_web_sm')

docs = []
for text, annot in TRAIN_DATA:

    doc = nlp(text)

    tags = biluo_tags_from_offsets(doc, annot['entities'])
    docs.append(doc)

srsly.write_json(train_name + "_spacy_format.json", [docs_to_json(docs)])

This created the json but I don't see any of my tagged entities in the generated json.

Saar answered 5/12, 2019 at 17:9 Comment(0)
K
7

Edit 3 is close, but you're missing a step where you add the entities to the document. This should work:

import spacy
import srsly
from spacy.gold import docs_to_json, biluo_tags_from_offsets, spans_from_biluo_tags

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_sm')
docs = []
for text, annot in TRAIN_DATA:
    doc = nlp(text)
    tags = biluo_tags_from_offsets(doc, annot['entities'])
    entities = spans_from_biluo_tags(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

It would be good to add a built-in function to do this conversion, since it's common to want to shift from the example scripts (which are just meant to be simple demos) to the train CLI.

Edit:

You can also skip the somewhat indirect use of the built-in BILUO converters and use what you had above:

    doc.ents = [doc.char_span(start_idx, end_idx, label=label) for start_idx, end_idx, label in annot["entities"]]
Karlotte answered 6/12, 2019 at 8:20 Comment(8)
This is what I ended up using this example, which I found on this spacy issue. What do you think about it? Your code seems waay easier to followSaar
It's pretty similar. I wouldn't filter out sentences with no entities (which are also good as training data) and I would be aware that these conversions will discard entities that don't line up exactly with spacy's token boundaries, which may not be what you want.Karlotte
Thanks a lot for your input - can you give an example on what you mean by: "conversions will discard entities that don't line up exactly with spacy's token boundaries"?Saar
Also I have seen people doing disable=["ner"]. why do they use it for?Saar
1) Try an entity span for the first sentence like (1, 5, "PERSON) and check what happens. (This actually crashes with doc.char_span(), so there the built-in functions are better, but can still skip spans.) 2) You need the parser to set sentence boundaries, but you can disable tagger and ner to make it faster, plus you don't want to accidentally include the automatic NER spans in the output. In the version above doc.ents=entities overwrites any entities, but if you do it incrementally you'd have to watch out that you reset doc.ents initially. Or just disable ner, which is easier.Karlotte
Thank you very much for detailed explanations!Saar
It's almost 1 year since this answer was given. Can you confirm, as of now this is still THE recommended way to convert spacy "simple" training format to json?Clepsydra
Yes, for spacy v2, this is still a good method. (It kind of feels like it's been way longer for everyone, but this answer is really barely six months old!)Karlotte
C
5
import spacy
import srsly
from spacy.training import docs_to_json, offsets_to_biluo_tags, biluo_tags_to_spans

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

nlp = spacy.load('en_core_web_lg')
docs = []
for text, annot in training_sub:
    doc = nlp(text)
    tags = offsets_to_biluo_tags(doc, annot['entities'])
    entities = biluo_tags_to_spans(doc, tags)
    doc.ents = entities
    docs.append(doc)

srsly.write_json("spacy_format.json", [docs_to_json(docs)])

As of spaCy v3.1, the above code works. Some relevant methods from spacy.gold have been renamed and migrated to spacy.training.

Closefitting answered 14/7, 2021 at 22:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.