Accuracy: ANNIE vs Stanford NLP vs OpenNLP with UIMA
Asked Answered
G

3

17

My work is planning on using a UIMA cluster to run documents through to extract named entities and what not. As I understand it, UIMA have very few NLP components packaged with it. I've been testing GATE for awhile now and am fairly comfortable with it. It does ok on normal text, but when we run it through some representative test data, the accuracy drops way down. The text data we have internally is sometimes all caps, sometimes all lowercase, or a mix of the two in the same document. Even using ANNIE's all caps rules, the accuracy still leaves much to be desired. I've recently heard of Stanford NLP and OpenNLP but haven't had time to extensively train and test them. How do those two compare in terms of accuracy with ANNIE? Do they work with UIMA like GATE does?

Thanks in advance.

Gillard answered 7/4, 2013 at 0:6 Comment(1)
ANNIE is rule-based. My guess is that Stanford NLP and OpenNLP should perform better since they are ML-based.Undressed
E
21

It's not possible/reasonable to give a general estimate on performance of these systems. As you said, on your test data the accuracy declines. That's for several reasons, one is the language characteristics of your documents, another is characteristics of the annotations you are expecting to see. Afaik for every NER task there are similar but still different annotation guidelines.

Having that said, on your questions:

ANNIE is the only free open source rule-based NER system in Java I could find. It's written for news articles and I guess tuned for the MUC 6 task. It's good for proof of concepts, but getting a bit outdated. Main advantage is that you can start improving it without any knowledge in machine learning, nlp, well maybe a little java. Just study JAPE and give it a shot.

OpenNLP, Stanford NLP, etc. come by default with models for news articles and perform (just looking at results, never tested them on a big corpus) better than ANNIE. I liked the Stanford parser better than OpenNLP, again just looking at documents, mostly news articles.

Without knowing what your documents look like I really can't say much more. You should decide if your data is suitable for rules or you go the machine learning way and use OpenNLP or Stanford parser or Illinois tagger or anything. The Stanford parser seems more appropriate for just pouring your data, training and producing results, while OpenNLP seems more appropriate for trying different algorithms, playing with parameters, etc.

For your GATE over UIMA dispute, I tried both and found more viral community and better documentation for GATE. Sorry for giving personal opinions :)

Elston answered 17/4, 2013 at 14:27 Comment(0)
T
5

Just for the record answering the UIMA angle: For both Stanford NLP and OpenNLP, there is excellent packaging as UIMA analysis engines available via the DKPro Core project.

Technicolor answered 22/12, 2014 at 15:36 Comment(0)
F
3

I would like to add one more note. UIMA and GATE are two frameworks for the creation of Natural Language Processing(NLP) applications. However, Name Entity Recognition (NER) is a basic NLP component and you can find an implementation of NER, independent of UIMA and GATE. The good news is you can usually find a wrapper for a decent NER in the UIMA and GATE. To make it clear let see this example:

It is the same for the Stanford NER component.

Coming back to your question, this website lists the state of the art NERs: http://www.aclweb.org/aclwiki/index.php?title=Named_Entity_Recognition_(State_of_the_art)

For example, in the MUC-7 competition, best participant named LTG got the result with the accuracy of 93.39%.

http://www.aclweb.org/aclwiki/index.php?title=MUC-7_(State_of_the_art)

Note that if you want to use such a state of are implementation, you may have some issue with their license.

Forseti answered 5/6, 2015 at 19:16 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.