Running Stanford corenlp server with French models
Asked Answered
C

2

6

I am trying to analyse some French text with the Stanford CoreNLP tool (it's my first time trying to use any StanfordNLP software)

To do so, I have downloaded the v3.6.0 jar and the corresponding french models.

Then I run the server with:

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

As described in this answer, I call the API with:

wget --post-data 'Bonjour le monde.' 'localhost:9000/?properties={"parse.model":"edu/stanford/nlp/models/parser/nndep/UD_French.gz", "annotators": "parse", "outputFormat": "json"}' -O -

but I get the following log + error:

 [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP  
 Adding annotator tokenize
 [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.TokenizerAnnotator - TokenizerAnnotator: No tokenizer type provided. Defaulting to PTBTokenizer.
 [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP -   Adding annotator ssplit
 [pool-1-thread-1] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator parse
 [pool-1-thread-1] INFO edu.stanford.nlp.parser.common.ParserGrammar - Loading parser from serialized file edu/stanford/nlp/models/parser/nndep/UD_French.gz ... 

 edu.stanford.nlp.io.RuntimeIOException: java.io.StreamCorruptedException: invalid stream header: 64696374
    at edu.stanford.nlp.parser.common.ParserGrammar.loadModel(ParserGrammar.java:188)
    at edu.stanford.nlp.pipeline.ParserAnnotator.loadModel(ParserAnnotator.java:212)
    at edu.stanford.nlp.pipeline.ParserAnnotator.<init>(ParserAnnotator.java:115)
    ...

The solutions proposed here suggest the code and model version differs but I have dowloaded them from the same page (and they both have the same version number in their name) so I am pretty sure they are the same.

Any other hint on what I am doing wrong?

(I should also mention that I am not a Java expert, so maybe I forgot a stupid step... )

Communicant answered 15/6, 2016 at 14:31 Comment(0)
C
10

Ok, after a lot of readings and unsuccessful tries, I found a way to make it work (for v3.6.0). Here are the details, if they can be of any interest to someone else:

  1. Dowload the code and french models from http://stanfordnlp.github.io/CoreNLP/index.html#download. Unzip the code .zip and copy the french model .jar to that directory (do not remove the english models, they have different names anyway)

  2. cd to that directory and then run the server with:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer
    

(it's a pity that the -prop flag doesn't help here)

  1. Call the API repeating the properties listed in the StanfordCoreNLP-french.properties:

    wget --header="Content-Type: text/plain; charset=UTF-8"
         --post-data 'Bonjour le monde.' 
         'localhost:9000/?properties={
           "annotators": "tokenize,ssplit,pos,parse", 
           "parse.model":"edu/stanford/nlp/models/lexparser/frenchFactored.ser.gz", 
           "pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger", 
           "tokenize.language":"fr", 
           "outputFormat": "json"}' 
      -O -
    

    which finally gives a 200 response using the French models!

(NB: don't know how to make it work with the UI (same for utf-8 support))

Communicant answered 16/6, 2016 at 10:14 Comment(3)
Can we run CoreNLP using our trained model?Catchpenny
Yes you can, with the parse.model option, and so on for the other annotators.Communicant
Jesus Christ thanks a lot! That would've taken me hours to figure out.Deane
F
0

As a potentially useful addition for some, this is what the complete properties file for German looks like:

# annotators
annotators = tokenize, ssplit, mwt, pos, ner, depparse

# tokenize
tokenize.language = de
tokenize.postProcessor = edu.stanford.nlp.international.german.process.GermanTokenizerPostProcessor

# mwt
mwt.mappingFile = edu/stanford/nlp/models/mwt/german/german-mwt.tsv

# pos
pos.model = edu/stanford/nlp/models/pos-tagger/german-ud.tagger

# ner
ner.model = edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz
ner.applyNumericClassifiers = false
ner.applyFineGrained = false
ner.useSUTime = false

# parse
parse.model = edu/stanford/nlp/models/srparser/germanSR.beam.ser.gz

# depparse
depparse.model = edu/stanford/nlp/models/parser/nndep/UD_German.gz

The complete property files for Arabic, Chinese, French, German and Spanish can all be found in the CoreNLP github repository.

Feodor answered 5/5, 2020 at 18:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.