What does NER model to find person names inside a resume/CV?
Asked Answered
F

4

10

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

The problem here is that document like those can have different structure, for example i can have:

CASE 1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

CASE 2

My name is John Travolta and I was born ...

Basically, i can have structured data (with different labels) or a context where i should find these entities.

What is the best approach for this kind of documents? Can a maxent model work in this case?


EDIT @vihari-piratla

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

Example:

Name: John
Birthdate: 2000-01-01

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

Feingold answered 28/12, 2015 at 23:54 Comment(15)
What kind of performance are you looking for? (Precision, recall, accuracy...)? Have you tried using an off-the-shelf English tool and if so how was the performance? You may be pleasantly surprised for this task. Do you have a labelled Italian NER person corpus?Peninsula
No i do not have labelled italian ner person corpus, I have around 20k documents that can be processed (we can label persons) it is a boring process and it will take a lot, should i follow this solution?Feingold
Generally supervised work the best, but you pay the cost of creating the corpus. It depends on your budget. :) I would try processing using an English NER tool and see what kind of results you get. Obviously the last name Travolta should be recognized as a person's name in both English and Italian - I think Travolta might even be an Italian last name. :) If that isn't good enough, I would start with whatever Italian POS tagger is available to find noun phrases and use some basic syntatic features (obviously capitalization to start with) and go from there.Peninsula
@ozborn, the problem is that in italian there are names with "articles/prepositions inside, sometimes with verbs too... so it will be very difficult if the user does not write his name with capital letters. I also thought about a pos tagger but this problem seems to be too difficultFeingold
Have you tried cis.uni-muenchen.de/~schmid/tools/TreeTagger which does POS tagging, apparently it has worked for Italian. I don't think the article/preposition inside is a huge difference, l think most English POS taggers would handle them as such names are common enough in the US.Peninsula
@Peninsula ok so, if i have understood you correctly the steps should be: 1. tokenize the text 2. pos tagging 3. regex to find specific patterns with a serie of nouns/articles... correct?Feingold
Mostly, I updated my answer below to give you more detailsPeninsula
Also have a look at EVALITA 2011 NER task and see what resources / approaches were efficient, have look at evalita.it/2011/tasks/NER and evalita.it/sites/evalita.fbk.eu/files/working_notes2011/NER/…Quillen
@Feingold Sorry for the delay in getting back. Your edit only made me more confused. Let me restate it and correct me if I am wrong. You have several CV documents and you wish to identify personal names in a few particular sections of the document and not every personal name in the entire document. I do not understand why it requires many patterns, for example, to match "Name: John" the pattern is "Name: (.*)". Like in the example provided, the pattern may not have any dependence on the entity, so it should be scalable, no?Wexford
@VihariPiratla yes, but i also could have name: {name} surname: {surname} or full name : {name} etc sometimes it ends with \n sometimes with a full stop etcFeingold
you just mentioned six different patterns. I believe that the total number of such patterns is still numerous/scalable.Wexford
@VihariPiratla yes but remember that i also need other regex to find other entities like birthdate, address etc... so Birthdate: {entity} address: {entity} and then i also have expression for normal language too, like my name is {entity} and my surname is {entity} so yes, there are not a lot, but if we sum all the regexes for all the entities, yes, there are many.Feingold
Take a look at: nlp.stanford.edu/pubs/Gupta_Manning_CoNLL14_slides.pdf. You think something like this can help you?Wexford
You are my new hero! Yes! That's what I should use!Feingold
Could you write this into your reply?Feingold
W
7

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

  1. Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.

  2. Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

  • SPIED is a software that uses the above-described technique and is available for download and use.
  • Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
  • For a light introduction on this topic, see these slides.

Thanks

Wexford answered 4/1, 2016 at 9:34 Comment(8)
Yes, I understand that. Since there are no readily available models, I have given some suggestions to make a good use of the English model.Wexford
Pardon, do you mean i should use an english model on an italian document?Feingold
Yes, but I assumed that the context is in English. You may have to change the question in order to make this clear if the context is not really in English.Wexford
I wrote that i am looking for an italian model or a way to train itFeingold
I gave you +500 however, i would like to understand it better. I can use the gazettes(dictionaries) but i should reduce false positive, so for case 1 should i adopt it without a regex rules ?Feingold
Using regex or some other template is better for Case 1 and use the names recognised in Case 1 to do a better job in Case 2 as I have explained in my post. You have to reduce false positives in which case, do you already have something implemented? Can you also provide more information about the data you are dealing with, provide a sample document from data if possible. Please edit your question and I will edit the answer accordingly. Thanks for the bounty :)Wexford
I can't edit my above comment now. I missed a Question mark '?' after "You have to reduce false positive in which case?" in my comment above.Wexford
Updated! I wait your reply, Thank you!Feingold
P
7

The traditional (and probably best) approach for Case 1 is to write document segmentation code, whereas Case 2 is what most systems are designed for. You can search google scholar for "document segmentation" to get some ideas for the "best" approach. The most commonly implemented (and easiest to do) is to simply use regular expressions which can be highly effective if the document structure is consistent. Other approaches are more complex but are usually needed when there is more diversity in document structure.

Your NER Pipeline at a minimum will need:

  1. Pre-processing / text tokenization. Start with just a few simple tokenization rules
  2. Document segmentation (colons, dashes, spotting headers, any forms, etc..). I would start with regular expressions for this.
  3. POS tagging (preferably using something off the shelf like TreeTagger that has worked with Italian)
  4. NER, a MaxEnt model will work, some important features for this would be capitalization, POS tags and probably dictionary features (Italian phonebook?). You will need some labelled data.
Peninsula answered 29/12, 2015 at 5:23 Comment(2)
thank you for your opinion. Basically, are you talking about a sentence detector ? I can train a model to detect the end of sentence, but what can i do after? should i apply a regex on each sentence?Feingold
No, not a sentence detector. That is often a separate part of the pipeline. I am talking about finding document segments, so things like colons, dashes and line-breaks are what you are looking for - not end of sentence signifiers which are periods, exclamation marks, question marks...Peninsula
W
7

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

  1. Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.

  2. Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

  • SPIED is a software that uses the above-described technique and is available for download and use.
  • Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
  • For a light introduction on this topic, see these slides.

Thanks

Wexford answered 4/1, 2016 at 9:34 Comment(8)
Yes, I understand that. Since there are no readily available models, I have given some suggestions to make a good use of the English model.Wexford
Pardon, do you mean i should use an english model on an italian document?Feingold
Yes, but I assumed that the context is in English. You may have to change the question in order to make this clear if the context is not really in English.Wexford
I wrote that i am looking for an italian model or a way to train itFeingold
I gave you +500 however, i would like to understand it better. I can use the gazettes(dictionaries) but i should reduce false positive, so for case 1 should i adopt it without a regex rules ?Feingold
Using regex or some other template is better for Case 1 and use the names recognised in Case 1 to do a better job in Case 2 as I have explained in my post. You have to reduce false positives in which case, do you already have something implemented? Can you also provide more information about the data you are dealing with, provide a sample document from data if possible. Please edit your question and I will edit the answer accordingly. Thanks for the bounty :)Wexford
I can't edit my above comment now. I missed a Question mark '?' after "You have to reduce false positive in which case?" in my comment above.Wexford
Updated! I wait your reply, Thank you!Feingold
I
4

you can use Stanford NLP.for example here is some python code that uses nltk and stanford mlp libraries

docText="your input string goes here"

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

this should give you all proper nouns in the input string

Inequality answered 31/12, 2015 at 11:49 Comment(2)
Yes but this does not solve the problem of detecting entities near specific labels.Feingold
Correct, this is more to solve your case2. Let me think a bit more about case 1Inequality
W
0

If it is resume/CV type document you are talking about, then the best bet is to build a corpus or start with a reduced "accuracy" expectation and build the corpus dynamically by teaching the system as users use your system. May it be OpenNLP or StanfordNLP or any other. Within limitations of my "learnings" , NER's are not really matured enough for Resume/CV type documents for English type by itself.

Winna answered 7/1, 2016 at 9:26 Comment(2)
What do you mean with " reduced accuracy expectation " ?Feingold
Sure. When we build the corpus, we may not get the results with a great level of accuracy that we may desire or expect. Again, this is because the corpus is still being built and the system is still being taught; if you will. Therefore, the accuracy may not be as expected. That's what I meant by reduced accuracy expectation. ~Hopefully I answered this time rightly. If not, I am sure you will ask again and pls feel free to, and I can try and explain more, what I meant. That helps me too by the way.Winna

© 2022 - 2024 — McMap. All rights reserved.