Tokenizing using Pandas and spaCy
Asked Answered
D

2

17

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of text per 'cell') and have been using pandas to organize/build the dataset. I'm hoping to use spaCy for all the nlp but can't quite figure out how to tokenize the text in my columns. I've read a bunch of the spaCy documentation, and googled around but all the examples I've found are for a single sentence or word - not 75K rows in a pandas df.

I've tried things like: df['new_col'] = [token for token in (df['col'])]

but would definitely appreciate some help/resources.

full (albeit messy) code available here

Declination answered 27/10, 2017 at 18:12 Comment(2)
Whats the specific issue you are having? Are you getting an error?Vibrator
@Vibrator I'm not getting an error, but the text doesn't seem to be tokenized (i.e. when I try to do further processing like lemmatization I get an error basically saying the text is still string format and not tokens).Declination
V
37

I've never used spaCy (nltk has always gotten the job done for me) but from glancing at the documentation it looks like this should work:

import spacy
nlp = spacy.load('en')

df['new_col'] = df['text'].apply(lambda x: nlp(x))

Note that nlp by default runs the entire SpaCy pipeline, which includes part-of-speech tagging, parsing and named entity recognition. You can significantly speed up your code by using nlp.tokenizer(x) instead of nlp(x), or by disabling parts of the pipeline when you load the model. E.g. nlp = spacy.load('en', parser=False, entity=False).

Vibrator answered 27/10, 2017 at 19:15 Comment(1)
How could we change the result in the column into list?Imphal
I
2

Make it faster using pandarallel

import spacy
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)    
nlp = spacy.load("en_core_web_sm")

df['new_col'] = df['text'].parallel_apply(lambda x: nlp(x))
Imre answered 1/3, 2022 at 17:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.