I would like to understand the best approach to the following problem.
I have documents really similar to resume/cv and I have to extract entities (Name, Surname, Birthday, Cities, zipcode etc).
To extract those entities I am combining different finders (Regex, Dictionary etc)
There are no problems with those finders, but, I am looking for a method / algorithm or something like that to confirm the entities.
With "confirm" I mean that I have to find specific term (or entities) in proximities (closer to the entities I have found).
Example:
My name is <name>
Name: <name>
Name and Surname: <name>
I can confirm the entity <name>
because it is closer to specific term that let me understand the "context". If i have "name" or "surname" words near the entity so i can say that i have found the <name>
with a good probability.
So the goal is write those kind of rules to confirm entities. Another example should be:
My address is ......, 00143 Rome
Italian zipcodes are 5 digits long (numeric only), it is easy to find a 5 digits number inside my document (i use regex as i wrote above), and i also check it by querying a database to understand if the number exists. The problem here is that i need one more check to confirm (definitely) it.
I must see if that number is near the entity <city>
, if yes, ok... I have good probabilities.
I also tried to train a model but i do not really have a "context" (sentences). Training the model with:
My name is: <name>John</name>
Name: <name>John</name>
Name/Surname: <name>John</name>
<name>John</name> is my name
does not sound good to me because:
- I have read we need many sentences to train a good model
- Those are not "sentences" i do not have a "context" (remember where I said the document is similar to resume/cv)
- Maybe those phrases are too short
I do not know how many different ways i could find to say the exact thing, but surely I can not find 15000 ways :)
What method should I use to try to confirm my entities?
Thank you so much!