How to convert text file to CoNLL format for malt parser?

There is a CoNLL formatting option for CoreNLP output, but unfortunately it doesn't match what MaltParser expects. (Confusingly, there are several different common CoNLL data formats, for the different competition years..)

If you run CoreNLP from the command line with the option -outputFormat conll, you'll get output in the following TSV format (example output at end of answer):

INDEX    WORD    LEMMA    POS    NER    DEPHEAD    DEPREL

MaltParser expects a bit different format, but you can customize the data input / output format. Try putting this content in maltparser/appdata/dataformat/myconll.xml:

<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="myconll" reader="tab" writer="tab">
    <column name="ID" category="INPUT" type="INTEGER"/>
    <column name="FORM" category="INPUT" type="STRING"/>
    <column name="LEMMA" category="INPUT" type="STRING"/>
    <column name="POSTAG" category="INPUT" type="STRING"/>
    <column name="NER" category="IGNORE" type="STRING"/>
    <column name="HEAD" category="HEAD" type="INTEGER"/>
    <column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
</dataformat>

Then add to your MaltParser config file (find an example config in maltparser/examples/optionexample.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment>
    <optioncontainer>
...
        <optiongroup groupname="input">
            <option name="format" value="myconll"/>
        </optiongroup>
    </optioncontainer>
...
</experiment>

Then you should be able to provide CoreNLP CoNLL output as training data to MaltParser.

Untested, but if the MaltParser docs are honest, this should work. Sources:

Example CoreNLP CoNLL output (I only used annotators tokenize,ssplit,pos):

$ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null

1   This    this    DT  _   _   _
2   is  be  VBZ _   _   _
3   a   a   DT  _   _   _
4   test    test    NN  _   _   _
5   .   .   .   _   _   _

Recommended topics

Hot tags