How to solve Spanish lemmatization problems with SpaCy?
Asked Answered
P

4

10

When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.

A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.

code:

nlp = spacy.load('es_core_news_sm')

def lemmatizer(text):  
  doc = nlp(text)
  return ' '.join([word.lemma_ for word in doc])

df['column'] = df['column'].apply(lambda x: lemmatizer(x))

I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:

text = 'personas, ideas, cosas' 
# translation: persons, ideas, things

print(lemmatizer(text))
# Current output:
personar , ideo , coser 
# translation:
personify, ideo, sew

# The expected output should be:
persona, idea, cosa

# translation: 
person, idea, thing
Peachy answered 4/3, 2020 at 21:30 Comment(11)
I'm not super familiar with SpaCy, but are you retraining it on your data or using it out of the box?Progression
@Progression I'm not retraining it, I'm using it directly on the df (the df is completly clean with regular expressions). That's why I tried with a simple text to find If it was working wrong. If there's any other library to lemmatize in spanish, let me know!Peachy
Maybe retraining the model is a good idea, becuase in the exaple wans't any complex words, but I don't know how to do it.Peachy
Once I tried to do lemmatization in Spanish, but the only useful thing I found was to go with stemming, using SnowBallStemmer from NLTK.Len
@JuanJavierSantosOchoa, yes I know my last option is this one, but I understand lemmatize is more efficient than steamming.Peachy
I'm not a Spanish speaker but for English lemmatization SpaCy relies on knowing what the part-of-speech is for each word. It gets this info during the tagging step of nlp(text), however it doesn't look like your text is real sentences so it's probably getting the POS tags wrong a lot. This will lead to errors. BTW... SpaCy is only about 85% correct for English lemmatization. You might want to look at Stanford's CoreNLP or CLiPS/pattern.en, although all of these solutions only get to low 90% accuracy, and all need to know the POS of the word.Acevedo
If you know the part-of-speech for each word (ie... if they're all nouns) you can skip the tagging step (nlp(text)) and call the lemmatizer directly with the POS type. This will speed up the process significantly and will likely improve accuracy as well.Acevedo
@Acevedo The test text is not a real sentece, are words that were wrongly lemmatize in the original text. Instead, the dataframe is full of senteces. I think the Stanford's CoreNLP doesn't have a spanish module.Peachy
The problem about the POS words is that the dataframe have 60k+ words. I have them If I apply the stopwords, even this way doesn't work it out, because I have verbs and nouns.Peachy
As e.g. in the question: the word person is lemmatize out to personify. Would you recommend me use steamming instead of lemma?Peachy
If you know the POS for each word, try calling the lemmatizer directly and passing in the POS. If you don't know the POS for each word, then stemming is probably your only option.Acevedo
G
21

Unlike the English lemmatizer, spaCy's Spanish lemmatizer does not use PoS information at all. It relies on a lookup list of inflected verbs and lemmas (e.g., ideo idear, ideas idear, idea idear, ideamos idear, etc.). It will just output the first match in the list, regardless of its PoS.

I actually developed spaCy's new rule-based lemmatizer for Spanish, which takes PoS and morphological information (such as tense, gender, number) into account. These fine-grained rules make it a lot more accurate than the current lookup lemmatizer. It will be released soon!

Meanwhile, you can maybe use Stanford CoreNLP or FreeLing.

Guillory answered 5/3, 2020 at 19:54 Comment(14)
I'll be waiting when you have the project realased. Meanwhile I will look up Standford CoreNLP and FreeLing (in your experience which one you recommend?)Peachy
I think both are very accurate, but I haven't used them that much to have a preference. FreeLing is rule-based and Stanford is neural.Guillory
When you release your new rule_based, post it as an update of your answer. It will be really helpful.Peachy
Finally, I used SandfordNLP, it is pretty accurate and accomplish with the requiements that I was looking for.Peachy
Hi @GuadalupeRomero. Thanks for the hint! Are you going to release new Spanish lemmatizer inside SpaCy project? How can I know about that? Besides, Does it happen the same with matcher and Spanish? I have tried a lot of different options, but always return a dark and discouraging voidBiflagellate
@JuanLuisChulilla Yes, there will be an official release of the new lemmatizer. I am not working on the matcher, but what do you mean exactly?Guillory
@GuadalupeRomero My mistake. I was wrong about Matcher. SorryBiflagellate
@Peachy Can you give any pointers about how to do this with StanfordNLP? The only information I can find is this Issue on github saying it is not possible.Purificator
@Purificator seems Stanford is outdated. Try instead Stanza which is the lastest and improved librarie created by Stanford NLP groupPeachy
!pip install stanza import stanza stanza.download('es', package='ancora', processors='tokenize,mwt,pos,lemma', verbose=True) stNLP = stanza.Pipeline(processors='tokenize,mwt,pos,lemma', lang='es', use_gpu=True) doc = stNLP('Barack Obama nació en Hawaii.') print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')Peachy
Hi @GuadalupeRomero do you have a lauch date for your spanish lemmatizer? Thank you : )Digressive
@RubialesAlberto it will be released with spacy v3Guillory
Hey @GuadalupeRomero, thank you for your work with Spacy, it will be very useful for all Spanish speakers analyzing texts. I am trying to use the rule base Spanish lemmatizer with nightly-spacy, like this: nlp_es_trf = spacy.load('es_dep_news_trf'); config = {"mode": "rule"}; nlp_es_trf.remove_pipe("lemmatizer"); nlp_es_trf.add_pipe("lemmatizer", config=config) But I found this error when trying to use it: ValueError: [E1004] Missing lemmatizer table(s) found for lemmatizer mode 'rule'. Required tables: ['lemma_rules']. Found: [] Maybe I am using it wrong?Clabo
@GuadalupeRomero Thanks for your contribution to the spacy project. I guess since I have installed version 3.0.3 on my environment I am using the lemmatizer update you mentioned on your answer, am not I? It seems to do its work propertly, at least for my usecaseBergman
A
3

One option is to make your own lemmatizer.

This might sound frightening, but fear not! It is actually very simple to do one.

I've recently made a tutorial on how to make a lemmatizer, the link is here:

https://medium.com/analytics-vidhya/how-to-build-a-lemmatizer-7aeff7a1208c

As a summary, you'd have to:

  • Have a POS Tagger (you can use spaCy tagger) to tag input words.
  • Get a corpus of words and their lemmas - here, I suggest you download a Universal Dependencies Corpus for Spanish - just follow the steps in the tutorial mentioned above.
  • Create a lemma dict from the words extracted in the corpus.
  • Save the dict and make a wrapper function that receives both the word and its PoS.

In code, it'd look like this:

def lemmatize(word, pos):
   if word in dict:
      if pos in dict[word]:
          return dict[word][pos]
   return word

Simple, right?

In fact, simple lemmatization doesn't require a lot of processing as one would think. The hard part lies at PoS Tagging, but you have that for free. Either way, if you want to do Tagging yourself, you can see this other tutorial I made:

https://medium.com/analytics-vidhya/part-of-speech-tagging-what-when-why-and-how-9d250e634df6

Hope you get it solved.

Adder answered 5/3, 2020 at 18:28 Comment(0)
A
3

Maybe you can use FreeLing, this library offers, among many functionalities lemmatization in Spanish, Catalan, Basque, Italian and other languages.

In my experience, lemmatization in Spanish and Catalan is quite accurate and although it natively supports C++, it has an API for Python and another for Java.

Aculeus answered 15/7, 2021 at 11:56 Comment(0)
D
1

You can use spacy-stanza. It has spaCy's API with the Stanza's models:

import stanza
from spacy_stanza import StanzaLanguage

text = "personas, ideas, cosas"

snlp = stanza.Pipeline(lang="es")
nlp = StanzaLanguage(snlp)
doc = nlp(text)
for token in doc:
    print(token.lemma_)
Daytime answered 15/12, 2020 at 8:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.