Can anyone point out the algorithm(s) used by openNLP NameFinder module? The code is complex and only sparsely documented and playing with it as a black box (with the default model provided) gives me the impression that it is mostly heuristic. Here are some examples for input and output:
Input:
John Smith is frustrated.
john smith is frustrated.
Barak Obama is frustrated.
Hugo Chavez is frustrated. (no more)
Jeff Atwood is frustrated.
Bing Liu is frustrated with openNLP NER module.
Noam Chomsky is frustrated with the world.
Jayden Smith is frustrated.
Smith Jayden is frustrated.
Lady Gaga is frustrated.
Ms. Gaga is frustrated.
Mrs. Gaga is frustrated.
Jayden is frustrated.
Mr. Liu is frustrated.
Output (I changed diamonds to square brackets) :
[START:person] John Smith [END] is frustrated.
john smith is frustrated.
[START:person] Barak Obama [END] is frustrated.
Hugo Chavez is frustrated. (no more)
[START:person] Jeff Atwood [END] is frustrated.
Bing Liu is frustrated with openNLP NER module.
[START:person] Noam Chomsky [END] is frustrated with the world.
Jayden [START:person] Smith [END] is frustrated.
[START:person] Smith [END] [START:person] Jayden [END] is frustrated.
Lady Gaga is frustrated.
Ms. Gaga is frustrated.
Mrs. Gaga is frustrated.
Jayden is frustrated.
Mr. Liu is frustrated.
It seems that the model simply learns a fixed list of names that are annotated in the training data and allows some tiling and combinations. Two notable (FN) examples are:
- Strong name indicators such as Mr. and Mrs. are ignored.
- Jayden (#4 most popular name in the US in 2011) wasn't identified while the following 'Smith' (in "Jayden Smith...") was identified. I suspect that the model "thinks" that the capitalized Jayden in the beginning of the sentence is due the beginning of sentence and not due being a NE. Reversing the order, "Smith Jayden" as a hint (assuming 1), openNLP identifies it as two distinctive NEs, unlike other full names such as "John Smith", maybe suggesting that 'Smith' is in the last-names list...
-> I'm puzzled and frustrated and if anyone could point me to the algorithm (or verify it sucks) I'll be thankful.
p.s. both Stanford and UIUC NER systems perform much better with some subtle differences that are interesting but off topic (this question is too long as is)