Spanish POS tagging with Stanford NLP - is it possible to get the person/number/gender?

Asked 10/4, 2015 at 7:53 Answered 3/8, 2015 at 4:21

I'm using Stanford NLP to do POS tagging for Spanish texts. I can get a POS Tag for each word but I notice that I am only given the first four sections of the Ancora tag and it's missing the last three sections for person, number and gender.

Why does Stanford NLP only use a reduced version of the Ancora tag?
Is it possible to get the entire tag using Stanford NLP?

Here is my code (please excuse the jruby...):

props = java.util.Properties.new()
props.put("tokenize.language", "es")
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
props.put("ner.model", "edu/stanford/nlp/models/ner/spanish.ancora.distsim.s512.crf.ser.gz")
props.put("pos.model", "/stanford-postagger-full-2015-01-30/models/spanish-distsim.tagger")
props.put("parse.model", "edu/stanford/nlp/models/lexparser/spanishPCFG.ser.gz")

pipeline = StanfordCoreNLP.new(props)
annotation = Annotation.new("No sé qué estoy haciendo. Me pregunto si esto va a funcionar.")

I am getting this as the output:

[Text=No CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=rn Lemma=no NamedEntityTag=O] [Text=sé CharacterOffsetBegin=3 CharacterOffsetEnd=5 PartOfSpeech=vmip000 Lemma=sé NamedEntityTag=O] [Text=qué CharacterOffsetBegin=6 CharacterOffsetEnd=9 PartOfSpeech=pt000000 Lemma=qué NamedEntityTag=O] [Text=estoy CharacterOffsetBegin=10 CharacterOffsetEnd=15 PartOfSpeech=vmip000 Lemma=estoy NamedEntityTag=O] [Text=haciendo CharacterOffsetBegin=16 CharacterOffsetEnd=24 PartOfSpeech=vmg0000 Lemma=haciendo NamedEntityTag=O] [Text=. CharacterOffsetBegin=24 CharacterOffsetEnd=25 PartOfSpeech=fp Lemma=. NamedEntityTag=O]

~~(I notice that the lemmas are incorrect also, but that's probably an issue for a separate question.~~ Nevermind, I see that Stanford NLP does not support Spanish lemmatization.)

Gentian answered 10/4, 2015 at 7:53 Comment(0)

Why does Stanford NLP only use a reduced version of the Ancora tag?

This was a practical decision made to ensure high tagging accuracy. (Retaining morphological information on tags caused the entire tagger to suffer from data sparsity, and do worse not only on morphological annotation but all over the board.)

Is it possible to get the entire tag using Stanford NLP?

No. You could get quite far doing this with a simple rule-based system, though, or use the Stanford Classifier to train your own morphological annotator. (Feel free to share your code if you pick either path!)

Quintuplet answered 10/4, 2015 at 13:56 Comment(2)

Thanks for the answer. I'm not sure if you know anything about Freeling but does this mean that the Freeling POS tagger would suffer from the issues you mentioned? – Gentian 13/4, 2015 at 0:38

You're right—I don't know much about Freeling internals.. they may do some sort of rule-based annotation on the tail end of their tagging process, or perhaps they just bite the bullet and handle the whole slew of possible tags. – Quintuplet 13/4, 2015 at 15:47

If it is not strict to only using the Stanford POS tagger, you might want to try the POS and morphological tagging toolkit RDRPOSTagger. RDRPOSTagger supports pre-trained POS and morphological tagging to 40 different languages, including Spanish.

For Spanish POS and morphological tagging, RDRPOSTagger was trained using the IULA Spanish LSP Treebank. RDRPOSTagger then obtained a tagging accuracy of 97.95% with the tagging speed at 200K words/second in Java implementation (10K words/second in Python implementation), using a computer of Window7 OS 64-bit core i5 2.50GHz CPU and 6GB of memory.

Gynecocracy answered 3/8, 2015 at 4:21 Comment(0)

Recommended topics

Hot tags