Best method to confirm an entity
Asked Answered
A

2

9

I would like to understand the best approach to the following problem.

I have documents really similar to resume/cv and I have to extract entities (Name, Surname, Birthday, Cities, zipcode etc).

To extract those entities I am combining different finders (Regex, Dictionary etc)

There are no problems with those finders, but, I am looking for a method / algorithm or something like that to confirm the entities.

With "confirm" I mean that I have to find specific term (or entities) in proximities (closer to the entities I have found).

Example:

My name is <name>
Name: <name>
Name and Surname: <name>

I can confirm the entity <name> because it is closer to specific term that let me understand the "context". If i have "name" or "surname" words near the entity so i can say that i have found the <name> with a good probability.

So the goal is write those kind of rules to confirm entities. Another example should be:

My address is ......, 00143 Rome

Italian zipcodes are 5 digits long (numeric only), it is easy to find a 5 digits number inside my document (i use regex as i wrote above), and i also check it by querying a database to understand if the number exists. The problem here is that i need one more check to confirm (definitely) it.

I must see if that number is near the entity <city>, if yes, ok... I have good probabilities.

I also tried to train a model but i do not really have a "context" (sentences). Training the model with:

My name is: <name>John</name>
Name: <name>John</name>
Name/Surname: <name>John</name>
<name>John</name> is my name

does not sound good to me because:

  1. I have read we need many sentences to train a good model
  2. Those are not "sentences" i do not have a "context" (remember where I said the document is similar to resume/cv)
  3. Maybe those phrases are too short

I do not know how many different ways i could find to say the exact thing, but surely I can not find 15000 ways :)

What method should I use to try to confirm my entities?

Thank you so much!

Alibi answered 4/9, 2015 at 14:16 Comment(0)
J
6

Problem statement

First of all, I don't think that your decomposition of the task into 2 steps (extract and confirm) is the best, if only I don't miss some specifics in the problem. If I understand correctly, you goal is to extract structured info like Name/City/etc from the set of docs with maximum precision and recall; either metric can be more important, but usually they are considered with equal weights - e.g. by using F1-measure.

Evaluate first

'You can't control what you can't measure' Tom DeMarco

I'd propose to firstly prepare evaluation system and marked up dataset: for each document find correct Name/City/etc - it can be done fully manually (which is more 'true', but more hard way) or semi-automatically, e.g. by applying some method, including that under development, and correcting its errors if any. Evaluation system should be able to compute Precision and Recall (see Confusion matrix in order to easily implement them by yourself).

As for its size, I wouldn't be so afraid by necessity of preparing too big dataset: sure, more is better, but it is crucial for the case with complex (significantly non-linear) tasks and a lot of features. I believe 100-200 docs to be enough for start in your case - and it would take several hours to prepare.

Then you can evaluate your simple extractors based on RegExps and Dictionaries - best if different aspects (Name or City) would have separate metrics. Depending on results, your actions may differ.

Low precision - add more specific features

If the method shows too low precision, i.e. extract too many wrong items, you should add specificity, or specific features; I'd search for them in scientific papers devoted to Information extraction concerning to those aiming at the specific information type, be it Name/Surname, or Address, or something more vague like skills if you're interested in such info. For instance, many papers (like [2] and [3]) devoted to Resume parsing note that Name/Surname are usually placed at the very beginning of the text; or that Cities are usually preceded by 'at'. I don't know specifics of you documents, but I doubt they violate such patterns.

Also it may be useful and easy to treat output of Named Entity Recognizer, e.g. Standord NLP, as a feature (see also relevant question)

Again, harder but better is to analyze approaches used by NERC and to adapt them to specifics of your task and docs.

These features can be aggregated by any Supervised machine learning (start with Logistic Regression and Random Forest if you have not much experience): you know positive and negative (all but not positive) answers from your evaluation dataset, just transform them into feature space and feed to some ML lib like Weka.

Low recall - extract more candidates

If the method shows too low recall, i.e. misses a lot of items, then you should extend set of candidates - for example, develop less restrictive patterns, add fuzzy matching (look at Jaro-Winkler or Soundex string metric) to Dictionary lookup.

Another option is to apply Part-of-Speech tagging and take each noun as a candidate - maybe each Proper noun for some info items, or take noun bigrams, or add other weak restrictions. In this case, most probably, your precision will degrade, so the paragraph above would have to be considered.

NB: If your data comes from Web (e.g. profiles from LinkedIn), try to search by keywords 'Web data extraction' or take a look at import.io

Literature

just a few random, try to search at Google scholar, preferably start from surveys

  1. Renuka S. Anami, Gauri R. Rao. Automated Profile Extraction and Classification with Stanford Algorithm. International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-4 Issue-7, December 2014 (link)

  2. Swapnil Sonar. Resume Parsing with Named Entity Clustering Algorithm. 2015 (link)

Jamestown answered 7/9, 2015 at 21:49 Comment(5)
I am following your post, i would like to understand if there is a solution to allow typos inside my documents. For example in the cities dictionary i have "Rome", how can i also get "Romee"? You talked about Jaro-Winkler but calculating the distances on a dictionary of 500.000 entities is too much expensive i think.Alibi
"The simplest such heuristic is to restrict the search to dictionary terms be- ginning with the same letter as the query string; the hope would be that spelling errors do not occur in the first character of the query." taken from 'Introduction to Information Retrieval', p.60 (nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf) - see chapter 3.3 for more sophisticated methodsJamestown
last question, what is the best method for classification? You wrote about NER. Which is the best model MAXENTROPY or PERCEPTRON? I will accept your answer.Alibi
There is no single best classifier, but for start, as I wrote in the answer, you can try Logistic Regression or Random Forest (just not forget to optimize hyperparameters like number of trees in RF - especially in case of Weka, which uses too small number of trees by default)Jamestown
I wouldn't advise to implement/train NER by yourself, at least for the first time, Try instead ready-to-use Stanford NER - there are several models on their page (nlp.stanford.edu/software/CRF-NER.shtml), choose the most suitable one or just take default.Jamestown
E
0

Programming is really an art at this point. You have to find a way where system users will strictly be using your fields so you can read them as [entity value].

What's popping in my mind right now with your idea is, how do this programming tools identify errors in codes and highlight what's causing it?

enter image description here

my 2 cents..hopefully this helps.

I'm really interested from the kinds of these projects!

Ensconce answered 9/9, 2015 at 4:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.