There is a CoNLL formatting option for CoreNLP output, but unfortunately it doesn't match what MaltParser expects. (Confusingly, there are several different common CoNLL data formats, for the different competition years..)
If you run CoreNLP from the command line with the option -outputFormat conll
, you'll get output in the following TSV format (example output at end of answer):
INDEX WORD LEMMA POS NER DEPHEAD DEPREL
MaltParser expects a bit different format, but you can customize the data input / output format. Try putting this content in maltparser/appdata/dataformat/myconll.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="myconll" reader="tab" writer="tab">
<column name="ID" category="INPUT" type="INTEGER"/>
<column name="FORM" category="INPUT" type="STRING"/>
<column name="LEMMA" category="INPUT" type="STRING"/>
<column name="POSTAG" category="INPUT" type="STRING"/>
<column name="NER" category="IGNORE" type="STRING"/>
<column name="HEAD" category="HEAD" type="INTEGER"/>
<column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
</dataformat>
Then add to your MaltParser config file (find an example config in maltparser/examples/optionexample.xml
):
<?xml version="1.0" encoding="UTF-8"?>
<experiment>
<optioncontainer>
...
<optiongroup groupname="input">
<option name="format" value="myconll"/>
</optiongroup>
</optioncontainer>
...
</experiment>
Then you should be able to provide CoreNLP CoNLL output as training data to MaltParser.
Untested, but if the MaltParser docs are honest, this should work. Sources:
Example CoreNLP CoNLL output (I only used annotators tokenize,ssplit,pos
):
$ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null
1 This this DT _ _ _
2 is be VBZ _ _ _
3 a a DT _ _ _
4 test test NN _ _ _
5 . . . _ _ _