I'm looking for a Java library that can do Named entity recognition (NER) with a custom controlled vocabulary, without needing labeled training data first. I searched some on SE, but most questions are rather unspecific.
Consider the following use-case:
- an editor is inputting articles in a CMS (about 500 words).
- the text may contain references (in plain text) to entities of a specific domain. e.g:
- names of points of interest, like bars, restaurants, as well as neighborhoods, etc.
- a controlled vocabulary of these entities exist (about 5.000 entities) .
- I imagine an entity to be a -tuple in the vocabulary
- after finishing the text, the user should be able to save the document.
- This triggers the workflow to scan the piece of text against the vocabulary, by comparing against the name of the entity. It's not required to have a 100% match: 97% on Jarao-winkler or whatever (I'm not familiar with what algo's NER uses) may be enough, I need this to be configurable.
- Hits are returned to the controller server-side. This in return returns JSON to the client containing of the entities, which are represented as suggested crosslinks to the editor.
Ideally, I'm looking for a project that uses NRE to suggests crosslinks within a CMS-environment to piggyback on. (I'm sure plugins for wordpress exist for example) not so sure if something similar exists in Java.
All other more general pointers to NRE-libraries which work with controlled custom vocabularies are welcome as well.