Tokenizing using Pandas and spaCy

Asked 27/10, 2017 at 18:12 Answered 1/3, 2022 at 17:29

Solved python python-3.x pandas tokenize spacy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

I've tried things like: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

full (albeit messy) code available here

Declination answered 27/10, 2017 at 18:12 Comment(2)

Whats the specific issue you are having? Are you getting an error? – Vibrator 27/10, 2017 at 18:31

@Vibrator I'm not getting an error, but the text doesn't seem to be tokenized (i.e. when I try to do further processing like lemmatization I get an error basically saying the text is still string format and not tokens). – Declination 27/10, 2017 at 19:11

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

Vibrator answered 27/10, 2017 at 19:15 Comment(1)

How could we change the result in the column into list? – Imphal 22/7, 2022 at 13:18

Make it faster using pandarallel

import spacy
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)    
nlp = spacy.load("en_core_web_sm")

df['new_col'] = df['text'].parallel_apply(lambda x: nlp(x))

Imre answered 1/3, 2022 at 17:29 Comment(0)

Recommended topics

Hot tags