Train Spacy NER on Indian Names
Asked Answered
M

4

6

I am trying to customize Spacy's NER to identify Indian names. Following this guide https://spacy.io/usage/training and this is the dataset I am using https://gist.githubusercontent.com/mbejda/9b93c7545c9dd93060bd/raw/b582593330765df3ccaae6f641f8cddc16f1e879/Indian-Female-Names.csv

As per the code , I am supposed to provide training data in following format:

TRAIN_DATA = [
    ('Shivani', {
        'entities': [(0, 6, 'PERSON')]
    }),
    ('Isha ', {
        'entities': [(0,3 , 'PERSON')]
    })
]

How do I provide training data to Spacy for ~12000 names as manually specifying each entity will be a chore? Is there any other tool available to tag all the names ?

Mandler answered 26/3, 2018 at 4:25 Comment(2)
Open the csv file, use csv.reader to read each row, create a tuple with (name, {'entities': [(x, y, 'PERSON')]}) or whatever the values are, append it to TRAIN_DATA. There's nothing particularly complicated here, but if you try it and get stuck somewhere, you can show us your code and where it's doing something wrong.Nacre
@shri_wahal - What is the best solution you found for your problem ?Tortoiseshell
G
8

You are missing the point of training a NLP library for custom names. The training data has to be a list of training entries that each have a sentence text with the location of the name(s) identified. Please review the training data example again to see how you need to supply a full sentence and not just a name.

Spacy is not meant to be a gazette matching tool. You are likely better off generating 100 sentences that use some of these names and then training Spacy on those annotated sentences. You can add more full sentence examples as needed to increase accuracy. Spacy's native NER for names is robust and does not need 12000 examples.

@ak_35's answer below provides examples of how to provide training sentences with the location of names labeled.

Gallic answered 27/3, 2018 at 2:27 Comment(0)
J
7

Your current format for providing TRAIN_DATA will not give you good results. Spacy needs data in the format as shown below

TRAIN_DATA = [
('Shivani lives in chennai', {
        'entities': [(0, 6, 'PERSON')]
    }),
 ('Did you talk to Shivani yesterday', {
        'entities': [(16, 22, 'PERSON')]
    }),

    ('Isha bought a new phone', {
        'entities': [(0,3 , 'PERSON')]
    })

]

See the documentation here. Coming to your question about automating the task of annotation 12000 entries, there are tools that can help you in quickly annotating your data. You can use prodigy (same developers as spacy) but it is a paid service. You can see it in action here. In case you give up on the NER, Pattern matching might also work well for you if you just need to find names in a document, it would be faster and more accurate too if done right.

Jehiel answered 1/5, 2018 at 16:35 Comment(0)
C
1

As noted by @ak_35, training data needs to be in a spaCy format.
One way to do so is to use the spacy-annotator which provides a simple UI to annotate the entities you are interested in (i.e. PERSON):

import pandas as pd
import re
from annotator.active_annotations import annotate

# Data
df = pd.DataFrame.from_dict({'full_text' : ['Shivani lives in chennai']})

# Annotations
dd = annotate(df,
            col_text = 'full_text',
            labels = ['PERSON'],
            sample_size=1,
            model = 'en',
            regex_flags=re.IGNORECASE
            )

After annotating the relevant names, you can see the output by doing:

# Output
dd['annotations'][0]
Concepcion answered 25/9, 2020 at 11:45 Comment(0)
C
-3

If you're trying to figure out the index of the names then it's quite simple

(0, len(name.split(sep=',')[0])-1)
Carbonic answered 26/3, 2018 at 7:30 Comment(1)
That wasn't the question.Mandler

© 2022 - 2024 — McMap. All rights reserved.