I had the same problem, so I looked it up, too. The method proposed by Christopher Manning is efficient, but the delicate point is to know how to decide which kind of separator is appropriate. One could say only a space should be allowed, e.g. "John Zorn" >> one entity. However, I may find the form "J.Zorn", so I should also allow certain punctuation marks. But what about "Jack, James and Joe" ? I might get 2 entities instead of 3 ("Jack James" and "Joe").
By digging a bit in the Stanford NER classes, I actually found a proper implementation of this idea. They use it to export entities under the form of single String
objects. For instance, in the method PlainTextDocumentReaderAndWriter.printAnswersTokenizedInlineXML
, we have:
private void printAnswersInlineXML(List<IN> doc, PrintWriter out) {
final String background = flags.backgroundSymbol;
String prevTag = background;
for (Iterator<IN> wordIter = doc.iterator(); wordIter.hasNext();) {
IN wi = wordIter.next();
String tag = StringUtils.getNotNullString(wi.get(AnswerAnnotation.class));
String before = StringUtils.getNotNullString(wi.get(BeforeAnnotation.class));
String current = StringUtils.getNotNullString(wi.get(CoreAnnotations.OriginalTextAnnotation.class));
if (!tag.equals(prevTag)) {
if (!prevTag.equals(background) && !tag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
} else if (!prevTag.equals(background)) {
out.print("</");
out.print(prevTag);
out.print('>');
out.print(before);
} else if (!tag.equals(background)) {
out.print(before);
out.print('<');
out.print(tag);
out.print('>');
}
} else {
out.print(before);
}
out.print(current);
String afterWS = StringUtils.getNotNullString(wi.get(AfterAnnotation.class));
if (!tag.equals(background) && !wordIter.hasNext()) {
out.print("</");
out.print(tag);
out.print('>');
prevTag = background;
} else {
prevTag = tag;
}
out.print(afterWS);
}
}
They iterate over each word, checking if it has the same class (answer) than the previous, as explained before. For this, they take advantage of the fact expressions considered as not being entities are flagged using the so-called backgroundSymbol
(class "O"). They also use the property BeforeAnnotation
, which represents the string separating the current word from the previous one. This last point allows solving the problem I initially raised, regarding the choice of an appropriate separator.
IOB
models inCRFClassifier
? – Olivares