How to do text pre-processing using spaCy?
Asked Answered
Z

5

18

How to do preprocessing steps like Stopword removal , punctuation removal , stemming and lemmatization in spaCy using python.

I have text data in csv file like paragraphs and sentences. I want to do text cleaning.

Kindly give example by loading csv in pandas dataframe

Zinnia answered 10/8, 2017 at 6:24 Comment(1)
It is pretty simple and straightforward in sPacy, first let us know what have you tried ?Syrinx
Z
14

This may help:

import spacy #load spacy
nlp = spacy.load("en", disable=['parser', 'tagger', 'ner'])
stops = stopwords.words("english")

def normalize(comment, lowercase, remove_stopwords):
    if lowercase:
        comment = comment.lower()
    comment = nlp(comment)
    lemmatized = list()
    for word in comment:
        lemma = word.lemma_.strip()
        if lemma:
            if not remove_stopwords or (remove_stopwords and lemma not in stops):
                lemmatized.append(lemma)
    return " ".join(lemmatized)


Data['Text_After_Clean'] = Data['Text'].apply(normalize, lowercase=True, remove_stopwords=True)
Zinnia answered 13/3, 2018 at 5:17 Comment(4)
where is the Data object imported that you are using?Twentytwo
this Data object looks like a pandas DataFrame to meCodex
@NathanMcCoy its a pandas data frame. Data = pd.read_csv("My_file.csv")Zinnia
Also you can get stop words from nlp object like stops = nlp.Defaults.stop_words. It's set, not listMohr
I
10

The best pipeline I have encountered so far is from Maksym Balatsko's Medium article Text preprocessing steps and universal reusable pipeline. The best part is that we can use it as part of scikit-learn transformer pipeline and supports multiprocess

I have modified what Maksym has done and kept packages to a minimum and used generators instead of lists to avoid loading data to memory:

import numpy as np
import multiprocessing as mp

import string
import spacy 
from sklearn.base import TransformerMixin, BaseEstimator

nlp = spacy.load("en_core_web_sm")

class TextPreprocessor(BaseEstimator, TransformerMixin):
    def __init__(self,
                 nlp = nlp,
                 n_jobs=1):
        """
        Text preprocessing transformer includes steps:
            1. Punctuation removal
            2. Stop words removal
            3. Lemmatization

        nlp  - spacy model
        n_jobs - parallel jobs to run
        """
        self.nlp = nlp
        self.n_jobs = n_jobs

    def fit(self, X, y=None):
        return self

    def transform(self, X, *_):
        X_copy = X.copy()

        partitions = 1
        cores = mp.cpu_count()
        if self.n_jobs <= -1:
            partitions = cores
        elif self.n_jobs <= 0:
            return X_copy.apply(self._preprocess_text)
        else:
            partitions = min(self.n_jobs, cores)

        data_split = np.array_split(X_copy, partitions)
        pool = mp.Pool(cores)
        data = pd.concat(pool.map(self._preprocess_part, data_split))
        pool.close()
        pool.join()

        return data

    def _preprocess_part(self, part):
        return part.apply(self._preprocess_text)

    def _preprocess_text(self, text):
        doc = self.nlp(text)
        removed_punct = self._remove_punct(doc)
        removed_stop_words = self._remove_stop_words(removed_punct)
        return self._lemmatize(removed_stop_words)

    def _remove_punct(self, doc):
        return (t for t in doc if t.text not in string.punctuation)

    def _remove_stop_words(self, doc):
        return (t for t in doc if not t.is_stop)

    def _lemmatize(self, doc):
        return ' '.join(t.lemma_ for t in doc)

You can use it as:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import  LogisticRegressionCV
from sklearn.pipeline import Pipeline

# ... assuming data split X_train, X_test ...

clf  = Pipeline(steps=[
        ('normalize', TextPreprocessor(n_jobs=-1)), 
        ('features', TfidfVectorizer(ngram_range=(1, 2), sublinear_tf=True)),
        ('classifier', LogisticRegressionCV(cv=5,solver='saga',scoring='accuracy', n_jobs=-1, verbose=1))
    ])

clf.fit(X_train, y_train)
clf.predict(X_test)

X_train is data that will pass through TextPreprocessing, then we extract features, then pass to a classifier.

Impropriate answered 2/5, 2020 at 8:38 Comment(0)
N
6

It can easily be done via a few commands. Also note that spacy doesn't support stemming. You can refer this to this thread

import spacy
nlp = spacy.load('en')

# sample text
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. \
Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown \
printer took a galley of type and scrambled it to make a type specimen book. It has survived not \
only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. \
It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, \
and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.\
There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration \
in some form, by injected humour, or randomised words which don't look even slightly believable. If you are \
going to use a passage of Lorem Ipsum, you need to be sure there isn't anything embarrassing hidden in the \
middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, \
making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined \
with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated \
Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc."""

# convert the text to a spacy document
document = nlp(text) # all spacy documents are tokenized. You can access them using document[i]
document[0:10] # = Lorem Ipsum is simply dummy text of the printing and

#the good thing about spacy is a lot of things like lemmatization etc are done when you convert them to a spacy document `using nlp(text)`. You can access sentences using document.sents
list(document.sents)[0]

# lemmatized words can be accessed using document[i].lemma_ and you can check 
# if a word is a stopword by checking the `.is_stop` attribute of the word.
# here I am extracting the lemmatized form of each word provided they are not a stop word
lemmas = [token.lemma_ for token in document if not token.is_stop]
Nisse answered 10/8, 2017 at 7:12 Comment(0)
C
5

I developed a package for this exact situation. Check out spacy-cleaner:

import spacy
import spacy_cleaner
from spacy_cleaner.processing import removers, mutators


model = spacy.load("en_core_web_sm")

pipeline = spacy_cleaner.Pipeline(
    model,
    removers.remove_stopword_token,
    removers.remove_punctuation_token,
    mutators.mutate_lemma_token,
)

texts = ["Hello, my name is Cellan!"]

pipeline.clean(texts)

# ['hello Cellan']

Check out our docs for more information. Hope it helps! :)

Charry answered 4/11, 2022 at 19:32 Comment(0)
M
-4

Please read their docs, here is one example:

https://nicschrading.com/project/Intro-to-NLP-with-spaCy/

Meshwork answered 10/8, 2017 at 6:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.