I'm using CoreNLP for named entity extraction and have run into a bit of an issue. The issue is that whenever a named entity is composed of more than one token, such as "Han Solo", the annotator does not return "Han Solo" as a single named entity, but as two separate entities, "Han" "Solo".
Is it possible to get the named entity as one token? I know I can make use of the CRFClassifier with classifyWithInlineXML to this extent, but my solution requires that I use CoreNLP, since I need to know the word number as well.
The following is the code that I have so far:
Properties props = new Properties();
props.put("annotators", "tokenize,ssplit,pos,lemma,ner,parse");
props.setProperty("ner.model", "edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz");
pipeline = new StanfordCoreNLP(props);
Annotation document = new Annotation(text);
pipeline.annotate(document);
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for (CoreMap sentence : sentences) {
for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
System.out.println(token.get(NamedEntityTagAnnotation.class));
}
}
Help me Obi-Wan Kenobi. You're my only hope.