How to solve Spanish lemmatization problems with SpaCy?

Asked 4/3, 2020 at 21:30 Answered 15/7, 2021 at 11:56

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.

A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.

code:

nlp = spacy.load('es_core_news_sm')

def lemmatizer(text):  
  doc = nlp(text)
  return ' '.join([word.lemma_ for word in doc])

df['column'] = df['column'].apply(lambda x: lemmatizer(x))

I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:

text = 'personas, ideas, cosas' 
# translation: persons, ideas, things

print(lemmatizer(text))

# Current output:
personar , ideo , coser 
# translation:
personify, ideo, sew

# The expected output should be:
persona, idea, cosa

# translation: 
person, idea, thing

Peachy answered 4/3, 2020 at 21:30 Comment(11)

I'm not super familiar with SpaCy, but are you retraining it on your data or using it out of the box? – Progression 4/3, 2020 at 21:40

@Progression I'm not retraining it, I'm using it directly on the df (the df is completly clean with regular expressions). That's why I tried with a simple text to find If it was working wrong. If there's any other library to lemmatize in spanish, let me know! – Peachy 4/3, 2020 at 21:45

Maybe retraining the model is a good idea, becuase in the exaple wans't any complex words, but I don't know how to do it. – Peachy 4/3, 2020 at 21:52

Once I tried to do lemmatization in Spanish, but the only useful thing I found was to go with stemming, using SnowBallStemmer from NLTK. – Len 4/3, 2020 at 21:58

@JuanJavierSantosOchoa, yes I know my last option is this one, but I understand lemmatize is more efficient than steamming. – Peachy 4/3, 2020 at 22:1

I'm not a Spanish speaker but for English lemmatization SpaCy relies on knowing what the part-of-speech is for each word. It gets this info during the tagging step of nlp(text), however it doesn't look like your text is real sentences so it's probably getting the POS tags wrong a lot. This will lead to errors. BTW... SpaCy is only about 85% correct for English lemmatization. You might want to look at Stanford's CoreNLP or CLiPS/pattern.en, although all of these solutions only get to low 90% accuracy, and all need to know the POS of the word. – Acevedo 4/3, 2020 at 22:24

If you know the part-of-speech for each word (ie... if they're all nouns) you can skip the tagging step (nlp(text)) and call the lemmatizer directly with the POS type. This will speed up the process significantly and will likely improve accuracy as well. – Acevedo 4/3, 2020 at 22:36

@Acevedo The test text is not a real sentece, are words that were wrongly lemmatize in the original text. Instead, the dataframe is full of senteces. I think the Stanford's CoreNLP doesn't have a spanish module. – Peachy 5/3, 2020 at 0:6

The problem about the POS words is that the dataframe have 60k+ words. I have them If I apply the stopwords, even this way doesn't work it out, because I have verbs and nouns. – Peachy 5/3, 2020 at 0:8

As e.g. in the question: the word person is lemmatize out to personify. Would you recommend me use steamming instead of lemma? – Peachy 5/3, 2020 at 0:11

If you know the POS for each word, try calling the lemmatizer directly and passing in the POS. If you don't know the POS for each word, then stemming is probably your only option. – Acevedo 5/3, 2020 at 0:22

Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.

I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!

Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.

Guillory answered 5/3, 2020 at 19:54 Comment(14)

I'll be waiting when you have the project realased. Meanwhile I will look up Standford CoreNLP and FreeLing (in your experience which one you recommend?) – Peachy 5/3, 2020 at 20:15

I think both are very accurate, but I haven't used them that much to have a preference. FreeLing is rule-based and Stanford is neural. – Guillory 7/3, 2020 at 1:18

When you release your new rule_based, post it as an update of your answer. It will be really helpful. – Peachy 7/3, 2020 at 3:56

Finally, I used SandfordNLP, it is pretty accurate and accomplish with the requiements that I was looking for. – Peachy 18/3, 2020 at 20:14

Hi @GuadalupeRomero. Thanks for the hint! Are you going to release new Spanish lemmatizer inside SpaCy project? How can I know about that? Besides, Does it happen the same with matcher and Spanish? I have tried a lot of different options, but always return a dark and discouraging void – Biflagellate 4/4, 2020 at 0:38

@JuanLuisChulilla Yes, there will be an official release of the new lemmatizer. I am not working on the matcher, but what do you mean exactly? – Guillory 4/4, 2020 at 18:25

@GuadalupeRomero My mistake. I was wrong about Matcher. Sorry – Biflagellate 5/4, 2020 at 12:32

@Peachy Can you give any pointers about how to do this with StanfordNLP? The only information I can find is this Issue on github saying it is not possible. – Purificator 26/7, 2020 at 5:22

@Purificator seems Stanford is outdated. Try instead Stanza which is the lastest and improved librarie created by Stanford NLP group – Peachy 26/7, 2020 at 18:26

!pip install stanza import stanza stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=True) stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True) doc = stNLP('Barack Obama nació en Hawaii.') print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n') – Peachy 26/7, 2020 at 18:30

Hi @GuadalupeRomero do you have a lauch date for your spanish lemmatizer? Thank you : ) – Digressive 9/8, 2020 at 18:52

@RubialesAlberto it will be released with spacy v3 – Guillory 11/8, 2020 at 10:34

Hey @GuadalupeRomero, thank you for your work with Spacy, it will be very useful for all Spanish speakers analyzing texts. I am trying to use the rule base Spanish lemmatizer with nightly-spacy, like this:

nlp_es_trf = spacy.load('es_dep_news_trf'); config = {"mode": "rule"};  nlp_es_trf.remove_pipe("lemmatizer"); nlp_es_trf.add_pipe("lemmatizer", config=config)

But I found this error when trying to use it:

ValueError: [E1004] Missing lemmatizer table(s) found for lemmatizer mode 'rule'. Required tables: ['lemma_rules']. Found: []

Maybe I am using it wrong? – Clabo 16/12, 2020 at 16:30

@GuadalupeRomero Thanks for your contribution to the spacy project. I guess since I have installed version 3.0.3 on my environment I am using the lemmatizer update you mentioned on your answer, am not I? It seems to do its work propertly, at least for my usecase – Bergman 24/2, 2021 at 11:54

One option is to make your own lemmatizer.

This might sound frightening, but fear not! It is actually very simple to do one.

I've recently made a tutorial on how to make a lemmatizer, the link is here:

https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c

As a summary, you'd have to:

Have a POS Tagger (you can use spaCy tagger) to tag input words.
Get a corpus of words and their lemmas - here, I suggest you download a Universal Dependencies Corpus for Spanish - just follow the steps in the tutorial mentioned above.
Create a lemma dict from the words extracted in the corpus.
Save the dict and make a wrapper function that receives both the word and its PoS.

In code, it'd look like this:

def lemmatize(word, pos):
   if word in dict:
      if pos in dict[word]:
          return dict[word][pos]
   return word

Simple, right?

In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:

https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6

Hope you get it solved.

Adder answered 5/3, 2020 at 18:28 Comment(0)

Maybe you can use FreeLing, this library offers, among many functionalities lemmatization in Spanish, Catalan, Basque, Italian and other languages.

In my experience, lemmatization in Spanish and Catalan is quite accurate and although it natively supports C++, it has an API for Python and another for Java.

Aculeus answered 15/7, 2021 at 11:56 Comment(0)

You can use spacy-stanza. It has spaCy's API with the Stanza's models:

import stanza
from spacy_stanza import StanzaLanguage

text = "personas, ideas, cosas"

snlp = stanza.Pipeline(lang="es")
nlp = StanzaLanguage(snlp)
doc = nlp(text)
for token in doc:
    print(token.lemma_)

Daytime answered 15/12, 2020 at 8:56 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags