I want to train a blank model for NER with my own entities. To do this, I need to use a dataset, which is currently in .csv form and features entity tags in the following format (I'll provide one example row for each relevant column):
Column: sentence
Value: I want apples
Column: data
Value: ['want;@command;2;6','apples';@fruit;7;13']
Column: entity
Value: I @command @fruit
Column: entity_types
Value: @bot/@command;@bot/@food/@fruit
In order to train SpaCy's NER, I need the training data as json in the following form:
TRAIN_DATA = [
('Who is Shaka Khan?', {
'entities': [(7, 17, 'PERSON')]
}),
('I like London and Berlin.', {
'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
})
]
Link to the relevant part in the SpaCy Docs
I've tried to find a solution for how I could re-format the data from the csv to the format required by SpaCy, but I was unsuccessful as of yet. The dataset does contain all the necessary information - text string, entity names, entity types, entity offsets - but I simply don't know how to get them in the correct form.
I would appreciate any and all help concerning how I would accomplish this!
sentence
column and 10data
columns (because some sentences include up to 10 different entities. – Ventricle