Entity Extraction/Recognition with free tools while feeding Lucene Index
Asked Answered
B

4

46

I'm currently investigating the options to extract person names, locations, tech words and categories from text (a lot articles from the web) which will then feeded into a Lucene/ElasticSearch index. The additional information is then added as metadata and should increase precision of the search.

E.g. when someone queries 'wicket' he should be able to decide whether he means the cricket sport or the Apache project. I tried to implement this on my own with minor success so far. Now I found a lot tools, but I'm not sure if they are suited for this task and which of them integrates good with Lucene or if precision of entity extraction is high enough.

My questions:

  • Does anyone have experience with some of the listed tools above and its precision/recall? Or if there is training data required + available.
  • Are there articles or tutorials where I can get started with entity extraction(NER) for each and every tool?
  • How can they be integrated with Lucene?

Here are some questions related to that subject:

Brooklet answered 17/9, 2011 at 13:42 Comment(1)
wether = a castrated ram ; you meant whetherTawny
E
18

The problem you are facing in the 'wicket' example is called entity disambiguation, not entity extraction/recognition (NER). NER can be useful but only when the categories are specific enough. Most NER systems doesn't have enough granularity to distinguish between a sport and a software project (both types would fall outside the typically recognized types: person, org, location).

For disambiguation, you need a knowledge base against which entities are being disambiguated. DBpedia is a typical choice due to its broad coverage. See my answer for How to use DBPedia to extract Tags/Keywords from content? where I provide more explanation, and mentions several tools for disambiguation including:

These tools often use a language-independent API like REST, and I do not know that they directly provide Lucene support, but I hope my answer has been beneficial for the problem you are trying to solve.

Efferent answered 19/9, 2011 at 15:26 Comment(4)
Thanks for your answer and for the pointers! When using NER I hoped to solve the entity disambiguation too, because tagging an article with apache wicket and java, programming language and so on would lead some how to an entity disambiguation solution when mapping them to its categories (e.g. software systems) ... I have to think about it some more timeBrooklet
on the maui indexer blog (really nice! maui-indexer.blogspot.com) I have found a nice tool: wikipedia-miner.cms.waikato.ac.nz/demos/search/?query=wicketBrooklet
NER will generally not help, because as I explained, very few if any NER systems will provide fine-grained distinctions enough identify software and sports, much less distinguish the two. Extractiv is an exception.Efferent
Yes, Wiki Miner is a great tool which I forgot about. Miner was developed by a research at the same university as the author of Maui.Efferent
B
3

You can use OpenNLP to extract names of people, places, organisations without training. You just use pre-exisiting models which can be downloaded from here: http://opennlp.sourceforge.net/models-1.5/

For an example on how to use one of these model see: http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind

Bicarbonate answered 8/7, 2014 at 13:16 Comment(0)
T
0

Rosoka is a commercial product that provides a computation of "Salience" which measures the importance of the term or entity to the document. Salience is based on the linguistic usage and not the frequency. Using the salience values you can determine the primary topic of the document as a whole. The output is in your choice of XML or JSON which makes it very easy to use with Lucene. It is written in java. There is an Amazon Cloud version available at https://aws.amazon.com/marketplace/pp/B00E6FGJZ0. The cost to try it out is $0.99/hour. The Rosoka Cloud version does not have all of the Java API features available to it that the full Rosoka does. Yes both versions perform entity and term disambiguation based on the linguistic usage.

The disambiguation, whether human or software requires that there is enough contextual information to be able to determine the difference. The context may be contained within the document, within a corpus constraint, or within the context of the users. The former being more specific, and the later having the greater potential ambiguity. I.e. typing in the key word "wicket" into a Google search, could refer to either cricket, Apache software or the Star Wars Ewok character (i.e. an Entity). The general The sentence "The wicket is guarded by the batsman" has contextual clues within the sentence to interpret it as an object. "Wicket Wystri Warrick was a male Ewok scout" should enterpret "Wicket" as the given name of the person entity "Wicket Wystri Warrick". "Welcome to Apache Wicket" has the contextual clues that "Wicket" is part of a place name, etc.

Tierney answered 28/10, 2013 at 13:42 Comment(0)
I
0

Lately I have been fiddling with stanford crf ner. They have released quite a few versions http://nlp.stanford.edu/software/CRF-NER.shtml

The good thing is you can train your own classifier. You should follow the link which has the guidelines on how to train your own NER. http://nlp.stanford.edu/software/crf-faq.shtml#a Unfortunately, in my case, the named entities are not efficiently extracted from the document. Most of the entities go undetected.

Just in case you find it useful.

Izaak answered 26/9, 2015 at 6:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.