Formatting training dataset for SpaCy NER
Asked Answered
V

1

7

I want to train a blank model for NER with my own entities. To do this, I need to use a dataset, which is currently in .csv form and features entity tags in the following format (I'll provide one example row for each relevant column):


Column: sentence

Value: I want apples


Column: data

Value: ['want;@command;2;6','apples';@fruit;7;13']


Column: entity

Value: I @command @fruit


Column: entity_types

Value: @bot/@command;@bot/@food/@fruit


In order to train SpaCy's NER, I need the training data as json in the following form:

    TRAIN_DATA = [
    ('Who is Shaka Khan?', {
        'entities': [(7, 17, 'PERSON')]
    }),
    ('I like London and Berlin.', {
        'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]
    })
]

Link to the relevant part in the SpaCy Docs

I've tried to find a solution for how I could re-format the data from the csv to the format required by SpaCy, but I was unsuccessful as of yet. The dataset does contain all the necessary information - text string, entity names, entity types, entity offsets - but I simply don't know how to get them in the correct form.

I would appreciate any and all help concerning how I would accomplish this!

Ventricle answered 22/11, 2017 at 21:14 Comment(0)
E
11

It wasn't 100% clear from your question whether you're also asking about the CSV extraction – so I'll just assume this is not the problem. (If it is, this should be pretty easy to achieve using the csv module. If the CSV data is messy and contains a bunch of stuff combined in one string, you might have to call split on it and do it the hacky way.)

If you're able to extract the "sentence" and "data" column in a format like this, you're actually very close to spaCy's training format already:

[{ 
    'sentence': 'I want apples'
    'data': [('want', '@command', 2, 6) ('apples', '@fruit', 7, 13)]
}]

It seems like your data counts the end character differently and with an offset of +1 compared to spaCy. So you'll have to adjust this by subtracting 1. I'm probably making this a lot more verbose than it should be, but I hope this makes it easier to follow:

TRAIN_DATA = []

for example in your_extracted_data:  # see example above
    entities = []
    for entity in example['data']:  # iterate over the entities
        text, label, start, end = entity  # ('want', '@command', 2, 6)
        label = label.split('@')[1].upper()  # not necessary, but nicer
        end = end - 1  # correct the end character index
        entities.append((start, end, label))
    # add training example of (text, annotations) tuple
    TRAIN_DATA.append((example['sentence'], {'entities': entities}))

This should give you training data that looks like this:

[
    ('I want apples', {'entities': [(2, 5, 'COMMAND'), (7, 12, 'FRUIT')]})
]
Edan answered 28/11, 2017 at 6:0 Comment(3)
Thank you very much, your answer really is helping me out and is exactly what I was trying to figure out! I can see how the code would work on the extracted data, but I am still missing a step in the CSV extraction process and I would appreciate it if you or anyone else reading this could point me in the right direction: As you said, the CSV did contain a bunch of stuff in one string, but I managed to hack everything apart, leaving me with a pandas df containing a sentence column and 10 data columns (because some sentences include up to 10 different entities.Ventricle
A data cell therefore contains either something like 'want', '@command', 2, 6 or NaN. What I'm still unclear about is how I turn this pandas dataframe into that format you gave as an example. Or, more specifically, what this format is and which terms I should google to learn about how to turn the df into it. Am I right in assuming that it consists of tuples nested inside lists nested inside a dictionary?Ventricle
Yes – the desired format is a list of tuples, containing a string (the text) and a dictionary. The dictionary has one entry 'entities', and its value is a list of tuples (triples) consisting of two integers (the start and end index) and a string (the label). I'm not that familiar with pandas dataframe, but it definitely seems like a common use case – so I'm sure you'll be able to figure this out. Even if you can only extract long strings – as long as their format is consistent, you can always write a hacky converter script in Python using split, strip etc.Edan

© 2022 - 2024 — McMap. All rights reserved.