How this can be done is currently (2013 releases) a bit of a mess, since there are two different sets of flags for two different DocumentReaderAndWriter
implementations. Sorry.
The most flexible support for different IOB styles is found in CoNLLDocumentReaderAndWriter
. You can have it map any IOB/IOE/... annotation done by hyphenated prefixes like your examples (B-BRAND) to any other while it is reading files with the flag:
-entitySubclassification IOB2
The resulting label set is then used for training and classification. The options are documented in the entitySubclassify()
method of CoNLLDocumentReaderAndWriter
: IOB1, IOB2, IOE1, IOE2, SBIEO, IO. You can find a discussion of IOB1 vs. IOB2 in Tjong Kim Sang and Veenstra 1999. By default the representation is mapped back to IOB1 on output, since that is the default used in the CoNLL conlleval
program, but you can keep it as what you mapped it to with the flag:
-retainEntitySubclassification
To use this DocumentReaderAndWriter
, you can give a training command like:
java8 -mx6g edu.stanford.nlp.ie.crf.CRFClassifier -prop conll.crf.chris2009.prop -readerAndWriter edu.stanford.nlp.sequences.CoNLLDocumentReaderAndWriter -entitySubclassification iob2
Alternatively, ColumnDocumentReaderAndWriter
is the default DocumentReaderAndWriter
which we use in the distributed models. The options you get with it are different and slightly more limited. You have these two flags:
-mergeTags
will take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and map them down to a prefix-less IO label ("BRAND") and use that for training and classifying.
-iobTags
can take either plain ("BRAND") or CoNLL-like ("I-BRAND") labels and maps them to IOB2.
In a sequence model, for any of the labeling schemes like IOB2, the labels are different classes. That is how these labeling schemes work. The special interpretation of "I-", "B-", etc. is left to the human observer and entity-level evaluation software. The included evaluation software will work with IOB1, IOB2, or prefixless IO encoding only.