How to convert text file to CoNLL format for malt parser?
Asked Answered
Y

1

5

I'm trying to use malt parser with the pre made english model. However, I do not know how to convert a text corpus of English sentences into the CoNLL format that is necessary for Malt Parser to operate on. I could not find any documentation on the site. How should I go about this?

Update. I am referring to this post Create .conll file as output of Stanford Parser to create a .conll. However, this is using Stanford Parser.

Yamada answered 16/11, 2014 at 22:20 Comment(2)
This is a raw text corpus or a treebank? You need to have a corpus with dependency annotations (either a dependency treebank or a constituency treebank, which can be converted into a dependency treebank).Yoheaveho
This is just a raw text corpus, The post I linked above has a procedure for retrieving the constituency treebank via Stanford Parser, but this is a .tree file. I believe Malt Parser only takes in .conll files?Yamada
Y
8

There is a CoNLL formatting option for CoreNLP output, but unfortunately it doesn't match what MaltParser expects. (Confusingly, there are several different common CoNLL data formats, for the different competition years..)

If you run CoreNLP from the command line with the option -outputFormat conll, you'll get output in the following TSV format (example output at end of answer):

INDEX    WORD    LEMMA    POS    NER    DEPHEAD    DEPREL

MaltParser expects a bit different format, but you can customize the data input / output format. Try putting this content in maltparser/appdata/dataformat/myconll.xml:

<?xml version="1.0" encoding="UTF-8"?>
<dataformat name="myconll" reader="tab" writer="tab">
    <column name="ID" category="INPUT" type="INTEGER"/>
    <column name="FORM" category="INPUT" type="STRING"/>
    <column name="LEMMA" category="INPUT" type="STRING"/>
    <column name="POSTAG" category="INPUT" type="STRING"/>
    <column name="NER" category="IGNORE" type="STRING"/>
    <column name="HEAD" category="HEAD" type="INTEGER"/>
    <column name="DEPREL" category="DEPENDENCY_EDGE_LABEL" type="STRING"/>
</dataformat>

Then add to your MaltParser config file (find an example config in maltparser/examples/optionexample.xml):

<?xml version="1.0" encoding="UTF-8"?>
<experiment>
    <optioncontainer>
...
        <optiongroup groupname="input">
            <option name="format" value="myconll"/>
        </optiongroup>
    </optioncontainer>
...
</experiment>

Then you should be able to provide CoreNLP CoNLL output as training data to MaltParser.

Untested, but if the MaltParser docs are honest, this should work. Sources:


Example CoreNLP CoNLL output (I only used annotators tokenize,ssplit,pos):

$ echo "This is a test." | java edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos -outputFormat conll 2>/dev/null

1   This    this    DT  _   _   _
2   is  be  VBZ _   _   _
3   a   a   DT  _   _   _
4   test    test    NN  _   _   _
5   .   .   .   _   _   _
Yoheaveho answered 17/11, 2014 at 5:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.