Passing a pandas dataframe column to an NLTK tokenizer - McMap

About

Passing a pandas dataframe column to an NLTK tokenizer

Asked 21/1, 2018 at 3:40 Answered 21/1, 2018 at 3:45

Solved python string pandas nltk tokenize

G

1

6

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object."

raw_df['sentences'] = raw_df.sentences.astype(str)
raw.df.sentences.dtypes

Out: dtype('O')

Then, I try to tokenize sentences and get a TypeError that the method is expecting a string or bytes-like object. What am I doing wrong?

raw_sentences=tokenizer.tokenize(raw_df)

Same TypeError for

raw_sentences = nltk.word_tokenize(raw_df)

Goatsucker answered 21/1, 2018 at 3:40 Comment(2)

What package is tokenizer.tokenize from? – Igneous 21/1, 2018 at 3:44

what does the data look like? – Amiraamis 21/1, 2018 at 3:45

D

6

I'm assuming this is an NLTK tokenizer. I believe these work by taking sentences as input and returning tokenised words as output.

What you're passing is raw_df - a pd.DataFrame object, not a str. You cannot expect it to apply the function row-wise, without telling it to, yourself. There's a function called apply for that.

raw_df['tokenized_sentences'] = raw_df['sentences'].apply(tokenizer.tokenize)

Assuming this works without any hitches, tokenized_sentences will be a column of lists.

Since you're performing text processing on DataFrames, I'd recommend taking a look at another answer of mine here: Applying NLTK-based text pre-proccessing on a pandas dataframe

Deaconess answered 21/1, 2018 at 3:45 Comment(0)

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.