Lemmatize a doc with spacy?
Asked Answered
F

3

6

I have a spaCy doc that I would like to lemmatize.

For example:

import spacy
nlp = spacy.load('en_core_web_lg')

my_str = 'Python is the greatest language in the world'
doc = nlp(my_str)

How can I convert every token in the doc to its lemma?

Fighter answered 2/8, 2018 at 16:16 Comment(0)
J
7

Each token has a number of attributes, you can iterate through the doc to access them.

For example: [token.lemma_ for token in doc]

If you want to reconstruct the sentence you could use: ' '.join([token.lemma_ for token in doc])

For a full list of token attributes see: https://spacy.io/api/token#attributes

Jara answered 2/8, 2018 at 16:58 Comment(3)
So, does the lemmatizer automatically run when doc is called on a string?Fighter
By default yes. You can control which pipelines components are run by default, though I'm not sure where exactly lemmatization happens. spacy.io/usage/processing-pipelines#disablingOverpass
To add to polm23's comment, this link in the docs shows the different components of the processing pipeline, and shows where lemmatization happens.Hiroshige
C
6

If you don’t need a particular component of the pipeline – for example, the NER or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.

For your case (Lemmatize a doc with spaCy) you only need the tagger component.

So here is a sample code:

import spacy

# keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_lg',  disable=["parser", "ner"])

my_str = 'Python is the greatest language in the world'

doc = nlp(my_str)
words_lemmas_list = [token.lemma_ for token in doc]
print(words_lemmas_list)

Output:

['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world']

Chemisorb answered 13/7, 2020 at 9:0 Comment(1)
In case someone else reading this is wondering what the different components of the pipeline are (i.e. what can be enabled/disabled), this link in the docs shows the different components of the processing pipeline.Hiroshige
R
1

This answer covers the case where your text consists of multiple sentences.

If you want to obtain a list of all tokens being lemmatized, do:

import spacy
nlp = spacy.load('en')
my_str = 'Python is the greatest language in the world. A python is an animal.'
doc = nlp(my_str)

words_lemmata_list = [token.lemma_ for token in doc]
print(words_lemmata_list)
# Output: 
# ['Python', 'be', 'the', 'great', 'language', 'in', 'the', 'world', '.', 
# 'a', 'python', 'be', 'an', 'animal', '.']

If you want to obtain a list of all sentences with each token being lemmatized, do:

sentences_lemmata_list = [sentence.lemma_ for sentence in doc.sents]
print(sentences_lemmata_list)
# Output:
# ['Python be the great language in the world .', 'a python be an animal .']
Recriminate answered 6/1, 2021 at 14:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.