When trying lemmatize in Spanish a csv with more than 60,000 words, SpaCy does not correctly write certain words, I understand that the model is not 100% accurate. However, I have not found any other solution, since NLTK does not bring a Spanish core.
A friend tried to ask this question in Spanish Stackoverflow, however, the community is quite small compared with this community, and we got no answers about it.
code:
nlp = spacy.load('es_core_news_sm')
def lemmatizer(text):
doc = nlp(text)
return ' '.join([word.lemma_ for word in doc])
df['column'] = df['column'].apply(lambda x: lemmatizer(x))
I tried to lemmatize certain words that I found wrong to prove that SpaCy is not doing it correctly:
text = 'personas, ideas, cosas'
# translation: persons, ideas, things
print(lemmatizer(text))
# Current output:
personar , ideo , coser
# translation:
personify, ideo, sew
# The expected output should be:
persona, idea, cosa
# translation:
person, idea, thing
SnowBallStemmer
from NLTK. – Lennlp(text)
, however it doesn't look like your text is real sentences so it's probably getting the POS tags wrong a lot. This will lead to errors. BTW... SpaCy is only about 85% correct for English lemmatization. You might want to look at Stanford's CoreNLP or CLiPS/pattern.en, although all of these solutions only get to low 90% accuracy, and all need to know the POS of the word. – Acevedonlp(text)
) and call the lemmatizer directly with the POS type. This will speed up the process significantly and will likely improve accuracy as well. – Acevedoperson
is lemmatize out topersonify
. Would you recommend me use steamming instead of lemma? – Peachy