What does NER model to find person names inside a resume/CV?

Asked 28/12, 2015 at 23:54 Answered 7/1, 2016 at 9:26

Solved nlp stanford-nlp named-entity-recognition

i just have started with Stanford CoreNLP, I would like to build a custom NER model to find persons.

Unfortunately, I did not find a good ner model for italian. I need to find these entities inside a resume/CV document.

The problem here is that document like those can have different structure, for example i can have:

CASE 1

- Name: John

- Surname: Travolta

- Last name: Travolta

- Full name: John Travolta

(so many labels that can represent the entity of the person i need to extract)

CASE 2

My name is John Travolta and I was born ...

Basically, i can have structured data (with different labels) or a context where i should find these entities.

What is the best approach for this kind of documents? Can a maxent model work in this case?

EDIT @vihari-piratla

At the moment, i adopt the strategy to find a pattern that has something on the left and something on the right, following this method i have 80/85% to find the entity.

Example:

Name: John
Birthdate: 2000-01-01

It means that i have "Name:" on the left of the pattern and a \n on the right (until it finds the \n). I can create a very long list of patterns like those. I thought about patterns because i do not need names inside "other" context.

For example, if the user writes other names inside a job experience i do not need them. Because i am looking for the personal name, not others. With this method i can reduce false positives because i will look at specific patterns not "general names".

A problem with this method is that i have a big list of patterns (1 pattern = 1 regex), so it does not scale so well if i add others.

If i can train a NER model with all those patterns it will be awesome, but i should use tons of documents to train it well.

Feingold answered 28/12, 2015 at 23:54 Comment(15)

What kind of performance are you looking for? (Precision, recall, accuracy...)? Have you tried using an off-the-shelf English tool and if so how was the performance? You may be pleasantly surprised for this task. Do you have a labelled Italian NER person corpus? – Peninsula 29/12, 2015 at 5:14

No i do not have labelled italian ner person corpus, I have around 20k documents that can be processed (we can label persons) it is a boring process and it will take a lot, should i follow this solution? – Feingold 29/12, 2015 at 9:32

Generally supervised work the best, but you pay the cost of creating the corpus. It depends on your budget. :) I would try processing using an English NER tool and see what kind of results you get. Obviously the last name Travolta should be recognized as a person's name in both English and Italian - I think Travolta might even be an Italian last name. :) If that isn't good enough, I would start with whatever Italian POS tagger is available to find noun phrases and use some basic syntatic features (obviously capitalization to start with) and go from there. – Peninsula 29/12, 2015 at 15:37

@ozborn, the problem is that in italian there are names with "articles/prepositions inside, sometimes with verbs too... so it will be very difficult if the user does not write his name with capital letters. I also thought about a pos tagger but this problem seems to be too difficult – Feingold 29/12, 2015 at 16:57

Have you tried cis.uni-muenchen.de/~schmid/tools/TreeTagger which does POS tagging, apparently it has worked for Italian. I don't think the article/preposition inside is a huge difference, l think most English POS taggers would handle them as such names are common enough in the US. – Peninsula 29/12, 2015 at 17:6

@Peninsula ok so, if i have understood you correctly the steps should be: 1. tokenize the text 2. pos tagging 3. regex to find specific patterns with a serie of nouns/articles... correct? – Feingold 29/12, 2015 at 17:8

Mostly, I updated my answer below to give you more details – Peninsula 29/12, 2015 at 17:49

Also have a look at EVALITA 2011 NER task and see what resources / approaches were efficient, have look at evalita.it/2011/tasks/NER and evalita.it/sites/evalita.fbk.eu/files/working_notes2011/NER/… – Quillen 29/12, 2015 at 23:14

@Feingold Sorry for the delay in getting back. Your edit only made me more confused. Let me restate it and correct me if I am wrong. You have several CV documents and you wish to identify personal names in a few particular sections of the document and not every personal name in the entire document. I do not understand why it requires many patterns, for example, to match "Name: John" the pattern is "Name: (.*)". Like in the example provided, the pattern may not have any dependence on the entity, so it should be scalable, no? – Wexford 8/1, 2016 at 8:36

@VihariPiratla yes, but i also could have name: {name} surname: {surname} or full name : {name} etc sometimes it ends with \n sometimes with a full stop etc – Feingold 8/1, 2016 at 8:56

you just mentioned six different patterns. I believe that the total number of such patterns is still numerous/scalable. – Wexford 8/1, 2016 at 8:59

@VihariPiratla yes but remember that i also need other regex to find other entities like birthdate, address etc... so Birthdate: {entity} address: {entity} and then i also have expression for normal language too, like my name is {entity} and my surname is {entity} so yes, there are not a lot, but if we sum all the regexes for all the entities, yes, there are many. – Feingold 8/1, 2016 at 9:22

Take a look at: nlp.stanford.edu/pubs/Gupta_Manning_CoNLL14_slides.pdf. You think something like this can help you? – Wexford 8/1, 2016 at 9:56

You are my new hero! Yes! That's what I should use! – Feingold 8/1, 2016 at 10:58

Could you write this into your reply? – Feingold 8/1, 2016 at 12:41

The first case could be trivial, and I agree with Ozborn's suggestion.

I would like to make a few suggestions for case-2.
Stanford NLP provides an excellent English name recognizer, but may not be able to find all the person names. OpenNLP also gives a decent performance, but much lesser than Stanford. There are many other entity recognizers available for English. I will focus here on StanfordNLP, here are a few things to consider.

Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.
Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

Making a benchmark for training is a very costly and boring process; you should annotate in the order of tens of thousands of sentences for decent test performance. I am sure that even if you have a model trained on annotated training data, the performance won't be any better than when you have the two steps above implemented.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

When supervised data is not available, a method called bootstrapped pattern-learning approach is generally used. The algorithm starts with a small set of seed instances of interest (like a list of books) and outputs more instances of the same type.
Refer the following resources for more information

SPIED is a software that uses the above-described technique and is available for download and use.
Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
For a light introduction on this topic, see these slides.

Thanks

Wexford answered 4/1, 2016 at 9:34 Comment(8)

Yes, I understand that. Since there are no readily available models, I have given some suggestions to make a good use of the English model. – Wexford 4/1, 2016 at 15:4

Pardon, do you mean i should use an english model on an italian document? – Feingold 4/1, 2016 at 15:48

Yes, but I assumed that the context is in English. You may have to change the question in order to make this clear if the context is not really in English. – Wexford 4/1, 2016 at 17:23

I wrote that i am looking for an italian model or a way to train it – Feingold 4/1, 2016 at 18:24

I gave you +500 however, i would like to understand it better. I can use the gazettes(dictionaries) but i should reduce false positive, so for case 1 should i adopt it without a regex rules ? – Feingold 7/1, 2016 at 9:50

Using regex or some other template is better for Case 1 and use the names recognised in Case 1 to do a better job in Case 2 as I have explained in my post. You have to reduce false positives in which case, do you already have something implemented? Can you also provide more information about the data you are dealing with, provide a sample document from data if possible. Please edit your question and I will edit the answer accordingly. Thanks for the bounty :) – Wexford 7/1, 2016 at 11:3

I can't edit my above comment now. I missed a Question mark '?' after "You have to reduce false positive in which case?" in my comment above. – Wexford 7/1, 2016 at 11:13

Updated! I wait your reply, Thank you! – Feingold 7/1, 2016 at 12:9

The traditional (and probably best) approach for Case 1 is to write document segmentation code, whereas Case 2 is what most systems are designed for. You can search google scholar for "document segmentation" to get some ideas for the "best" approach. The most commonly implemented (and easiest to do) is to simply use regular expressions which can be highly effective if the document structure is consistent. Other approaches are more complex but are usually needed when there is more diversity in document structure.

Your NER Pipeline at a minimum will need:

Pre-processing / text tokenization. Start with just a few simple tokenization rules
Document segmentation (colons, dashes, spotting headers, any forms, etc..). I would start with regular expressions for this.
POS tagging (preferably using something off the shelf like TreeTagger that has worked with Italian)
NER, a MaxEnt model will work, some important features for this would be capitalization, POS tags and probably dictionary features (Italian phonebook?). You will need some labelled data.

Peninsula answered 29/12, 2015 at 5:23 Comment(2)

thank you for your opinion. Basically, are you talking about a sentence detector ? I can train a model to detect the end of sentence, but what can i do after? should i apply a regex on each sentence? – Feingold 29/12, 2015 at 9:30

No, not a sentence detector. That is often a separate part of the pipeline. I am talking about finding document segments, so things like colons, dashes and line-breaks are what you are looking for - not end of sentence signifiers which are periods, exclamation marks, question marks... – Peninsula 29/12, 2015 at 15:40

The first case could be trivial, and I agree with Ozborn's suggestion.

Gazettes. You can provide the model with a list of names and also customize how the Gazette entries are matched. Stanford also provides a sloppy match option when set, will allow partial matches with the Gazette entries. Partial matches should work well with the person names.
Stanford recognizes entities constructively. If in a document, a name like "John Travolta" is recognized, then it would also get "Travolta" in the same document even if it had no prior idea about "Travolta". So, append as much information to the document as possible. Add the names recognized in case-1, in a familiar context like "My name is John Travolta." if "John Travolta" is recognized by the rules employed in case-1. Adding dummy sentences can improve the recall.

@edit

Since the asker of this question is interested in unsupervised pattern-based approaches, I am expanding my answer to discuss these.

SPIED is a software that uses the above-described technique and is available for download and use.
Sonal Gupta received Ph.D. on this topic, her dissertation is available here.
For a light introduction on this topic, see these slides.

Thanks

Wexford answered 4/1, 2016 at 9:34 Comment(8)

Yes, I understand that. Since there are no readily available models, I have given some suggestions to make a good use of the English model. – Wexford 4/1, 2016 at 15:4

Pardon, do you mean i should use an english model on an italian document? – Feingold 4/1, 2016 at 15:48

Yes, but I assumed that the context is in English. You may have to change the question in order to make this clear if the context is not really in English. – Wexford 4/1, 2016 at 17:23

I wrote that i am looking for an italian model or a way to train it – Feingold 4/1, 2016 at 18:24

I can't edit my above comment now. I missed a Question mark '?' after "You have to reduce false positive in which case?" in my comment above. – Wexford 7/1, 2016 at 11:13

Updated! I wait your reply, Thank you! – Feingold 7/1, 2016 at 12:9

you can use Stanford NLP.for example here is some python code that uses nltk and stanford mlp libraries

docText="your input string goes here"

words = re.split("\W+",docText) 

stops = set(stopwords.words("english"))

#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]

str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz') 
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger') 
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']

print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged

this should give you all proper nouns in the input string

Inequality answered 31/12, 2015 at 11:49 Comment(2)

Yes but this does not solve the problem of detecting entities near specific labels. – Feingold 31/12, 2015 at 12:48

Correct, this is more to solve your case2. Let me think a bit more about case 1 – Inequality 5/1, 2016 at 1:21

If it is resume/CV type document you are talking about, then the best bet is to build a corpus or start with a reduced "accuracy" expectation and build the corpus dynamically by teaching the system as users use your system. May it be OpenNLP or StanfordNLP or any other. Within limitations of my "learnings" , NER's are not really matured enough for Resume/CV type documents for English type by itself.

Winna answered 7/1, 2016 at 9:26 Comment(2)

What do you mean with " reduced accuracy expectation " ? – Feingold 7/1, 2016 at 9:37

Sure. When we build the corpus, we may not get the results with a great level of accuracy that we may desire or expect. Again, this is because the corpus is still being built and the system is still being taught; if you will. Therefore, the accuracy may not be as expected. That's what I meant by reduced accuracy expectation. ~Hopefully I answered this time rightly. If not, I am sure you will ask again and pls feel free to, and I can try and explain more, what I meant. That helps me too by the way. – Winna 8/1, 2016 at 21:21

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

EDIT @vihari-piratla

Recommended topics

Hot tags