Create .conll file as output of Stanford Parser
Asked Answered
U

3

3

I want to use Stanford Parser to create a .conll file for further processing. So far I managed to parse the test sentence with the command:

stanford-parser-full-2013-06-20/lexparser.sh  stanford-parser-full-2013-06-20/data/testsent.txt > output.txt

Instead of a txt file I would like to have a file in .conll. I'm pretty sure it is possible, at it is mentioned in the documentation (see here). Can I somehow modify my command or will I have to write Javacode?

Thanks for help!

Ubald answered 3/7, 2013 at 14:24 Comment(0)
J
9

If you're looking for dependencies printed out in CoNLL X (CoNLL 2006) format, try this from the command line:

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat "penn" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz stanford-parser-full-2013-06-20/data/testsent.txt >testsent.tree

java -mx150m -cp "stanford-parser-full-2013-06-20/*:" edu.stanford.nlp.trees.EnglishGrammaticalStructure -treeFile testsent.tree -conllx

Here's the output for the first test sentence:

1       Scores        _       NNS     NNS     _       4       nsubj        _       _
2       of            _       IN      IN      _       0       erased       _       _
3       properties    _       NNS     NNS     _       1       prep_of      _       _
4       are           _       VBP     VBP     _       0       root         _       _
5       under         _       IN      IN      _       0       erased       _       _
6       extreme       _       JJ      JJ      _       8       amod         _       _
7       fire          _       NN      NN      _       8       nn           _       _
8       threat        _       NN      NN      _       4       prep_under   _       _
9       as            _       IN      IN      _      13       mark         _       _
10      a             _       DT      DT      _      12       det          _       _
11      huge          _       JJ      JJ      _      12       amod         _       _
12      blaze         _       NN      NN      _      15       xsubj        _       _
13      continues     _       VBZ     VBZ     _       4       advcl        _       _
14      to            _       TO      TO      _      15       aux          _       _
15      advance       _       VB      VB      _      13       xcomp        _       _
16      through       _       IN      IN      _       0       erased       _       _
17      Sydney        _       NNP     NNP     _      20       poss         _       _
18      's            _       POS     POS     _       0       erased       _       _
19      north-western _       JJ      JJ      _      20       amod         _       _
20      suburbs       _       NNS     NNS     _      15       prep_through _       _
21      .             _       .       .       _       4       punct        _       _
Jeffreys answered 27/7, 2013 at 20:50 Comment(0)
R
4

I'm not sure you can do this through command line, but this is a java version:

for (List<HasWord> sentence : new DocumentPreprocessor(new StringReader(filename))) {
        Tree parse = lp.apply(sentence);

        GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
        GrammaticalStructure.printDependencies(gs, gs.typedDependencies(), parse, true, false);
}
Risk answered 23/7, 2013 at 20:26 Comment(0)
S
0

There is a conll2007 output, see the TreePrint documentation for all options.

Here is an example using the 3.8 version of the Stanford parser. It assumes an input file of one sentence per line, output in Stanford Dependencies (not Universal Dependencies), no propagation/collapsing, keep punctuation, and output in conll2007:

java -Xmx4g -cp "stanford-corenlp-full-2017-06-09/*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -sentences newline -outputFormat conll2007 -originalDependencies -outputFormatOptions "basicDependencies,includePunctuationDependencies" edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz input.txt
Sumptuary answered 24/3, 2020 at 21:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.