Use Spacy to find Lemma of Russian (Those langs which don't have model)
Asked Answered
E

4

7

I have downloaded Spacy English model and finding lemma using this code.

import spacy
nlp = spacy.load('en')
doc = nlp(u'Two apples')
for token in doc:
    print(token, token.lemma, token.lemma_)

Output:

Two 11711838292424000352 two
apples 8566208034543834098 apple

Now I wanted to do same thing for Russian language. But Spacy don't have models for Russian language. But I am seeing their GitHub code for Russian language and I think that code could be used to find lemma.

I am new to Spacy. Will needed a starting point for those languages which don't have models. Also I have noted that for some languages let say for URDU they have provided a look up dictionary for lemmatization.

I want to expand this thing to all those languages which don't have models.

Note: In above code I believe that it could be further improved as in my case I needed lemma only so what are the things which I can turn off and how?

Epigene answered 4/2, 2019 at 8:15 Comment(5)
Use can use multi-language Spacy package and see the accuracyVenable
@RahulAgarwal I have checked that is not helpful.Epigene
Does converting Russian to English using Google Translate and then find lemma and convert back to English..Makes sense ?Venable
Not a feasible solution in my case.Epigene
Hmm They have given code on GitHub, I am struggling on how I can use it. If you see above GitHub code and can share some simple code example that will be helpful.Epigene
C
10

enter image description here - spaCy recently launched a handy wrapper over Stanford NLP, so you can use StanfordNLP goodies seamlessly within spaCy pipelines:

https://github.com/explosion/spacy-stanfordnlp

The code would look something like this ( not tested ) :

import stanfordnlp
from spacy_stanfordnlp import StanfordNLPLanguage

stanfordnlp.download("ru")

snlp = stanfordnlp.Pipeline(lang="ru")
nlp = StanfordNLPLanguage(snlp)

doc = nlp("Привет мир, это Россия")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.dep_)
Cerelia answered 6/2, 2019 at 11:35 Comment(3)
Stanford CoreNLP supports lemma for English language only. Check this link: stanfordnlp.github.io/CoreNLP/human-languages.htmlEpigene
The benchmarking results indicate that the Russian language models do support lemma -> stanfordnlp.github.io/stanfordnlp/performance.htmlCerelia
Thanks for sharing. I also found out this link later on.Epigene
B
6

You can use Spacy with russian model ru2 from this project. It works.

Badajoz answered 26/3, 2019 at 12:18 Comment(0)
G
3

You can now use Stanza (previously StandfordNLP) for this

import stanza
from spacy_stanza import StanzaLanguage

stanza.download('ru')  # will take a while

snlp = stanza.Pipeline(lang="ru")
nlp = StanzaLanguage(snlp)

text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)."
doc = nlp(text)

for token in doc:
    print(token, token.lemma, token.lemma_)

All available models are listed here https://stanfordnlp.github.io/stanza/models.html

Gusto answered 18/3, 2020 at 13:57 Comment(1)
I get ImportError: The Russian lemmatizer requires the pymorphy2 library . Is stanza just a wrapper for the pymorphy2 library?Henryhenryetta
B
3

You can use Russian lemmatizer from spaCy. Following this tutorial the result will be:

from spacy.lang.ru import Russian
nlp = Russian()

def lemmatization(text):
   doc = nlp(text)
   for token in doc:
      print(token, token.lemma, token.lemma_)
   tokens = [token.lemma_ for token in doc]
   return " ".join(tokens)

text = "Андре́й Серге́евич Арша́вин (род. 29 мая 1981[4], Ленинград) — российский футболист, бывший капитан сборной России, заслуженный мастер спорта России (2008)."
lemmatization(text)

Output:

'Андре́й серге́ Арша́вин ( род . 29 мая 1981[4 ] , Ленинград ) — российский футболист , бывший капитан сборной России , заслуженный мастер спорт России ( 2008 ) .'

It also can be useful to use stemming after lemmatization:

import nltk
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='russian')
tokenizer = nltk.tokenize.WhitespaceTokenizer()

def stemming(text):
   tokens = [stemmer.stem(w) for w in tokenizer.tokenize(text)]
   return " ".join(tokens)

stemming(text)

Output:

'андре́ серге́евич арша́вин (род. 29 ма 1981[4], ленинград) — российск футболист, бывш капита сборн россии, заслужен мастер спорт росс (2008).'
Battledore answered 27/8, 2020 at 4:31 Comment(1)
The stemming below worked, but the lemmatization above does not return the same text as shown in the output for me.Rb

© 2022 - 2024 — McMap. All rights reserved.