How apache UIMA is different from Apache Opennlp
Asked Answered
W

1

15

I have been doing some capability testing with Apache OpenNLP, Which has the capability to Sentence detection, Tokenization, Name entity recognition. Now when i started looking at UIMA documents it is mentioned on the UIMA home page - "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)".

Which says that i can use UIMA to do the same task as done by OpenNLP. What added feature both have ? I am new to this area, Please help me to understand the uses and capability perspective of both.

Wilbur answered 19/5, 2015 at 6:46 Comment(1)
By quick glance, looks like UIMA uses OpenNLP to provide a framework for managing resources of many different kinds, including images, audio, and video, while obviously the focus of OpenNLP is text files containing human language.Ginsburg
P
30

As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims.

Apache UIMA is an open source implementation of the UIMA specification. The latter defines a conceptual framework for augmenting unstructured information (such as natural language produced by humans) with structured metadata so that computers can work with it.

As an example for an application working with unstructured information, let us take a an application that takes natural language text as input and marks all named entities in the given text, e.g.

Input text = "Bob's cat Charlie is chasing a mouse."
Result = "<NE>Bob</NE>'s cat <NE>Charlie</NE> is chasig a mouse."

To identify the named entities in this example (i.e. Bob and Charlie), several steps of natural language processing have to be performed. Without going into detail about what each of the steps does, a hypothetical system for named entity recognition might involve the following steps:

  1. Data preparation
  2. Sentence splitting
  3. Tokenization
  4. Token lemmatization
  5. Part-of-speech tagging
  6. Phrase detection
  7. Classifying phrases as named entities or not

As you can see, such applications can be very intuitively modelled as sequences of components, and this is exactly what UIMA does. It models applications dealing with unstructed information as pipelines of components (called analytics in UIMA parlance). As you can imagine, many of the pipeline components listed above can be used for other tasks and so the architecture design of UIMA emphasizes reusability of components.

To avoid confusion, the UIMA standard itself doesn't provide any specific components, but defines an infrastructure for UIM (Unstructured Information Management) applications, e.g. workflows, data types, inter-component communication, and so on.

Apache OpenNLP on the other hand does exactly that, namely provide concrete implementations of NLP algorithms dealing with very specific tasks (sentence splitting, POS-tagging, etc.). The source of your confusion might be that it is possible to write Apache UIMA components that wrap OpenNLP tools. The OpenNLP project actually provides such components.

Whether you want to use the UIMA framework for your UIM applications depends on the size of the project. If it is small, I would go without UIMA and just use OpenNLP directly, as UIMA is rather heavy-weight and thus only adds complex yet (for small applications) unnecessary overhead. Also, due to its complexity, it takes a good amount of time to learn how to use it.

Summing up, Apache UIMA and Apache OpenNLP solve different problems, but since both deal with unstructured information, they can be combined profitably.

Percentile answered 19/5, 2015 at 14:32 Comment(1)
Your answer is to the point. It clarified my understanding by a great deal. What are some of other tools in each of these spaces i.e. 1. In building NLP pipelines and 2. In providing concrete implementations for NLP algorithms other than UIMA and OpenNLP? ThanksInfare

© 2022 - 2024 — McMap. All rights reserved.