How to apply NLTK word_tokenize library on a Pandas dataframe for Twitter data?
Asked Answered
S

1

15

This is the Code that I am using for semantic analysis of twitter:-

import pandas as pd
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer

df=pd.read_csv('twitDB.csv',header=None, 
sep=',',error_bad_lines=False,encoding='utf-8')

hula=df[[0,1,2,3]]
hula=hula.fillna(0)
hula['tweet'] = hula[0].astype(str) 
+hula[1].astype(str)+hula[2].astype(str)+hula[3].astype(str) 
hula["tweet"]=hula.tweet.str.lower()

ho=hula["tweet"]
ho = ho.replace('\s+', ' ', regex=True) 
ho=ho.replace('\.+', '.', regex=True)
special_char_list = [':', ';', '?', '}', ')', '{', '(']
for special_char in special_char_list:
ho=ho.replace(special_char, '')
print(ho)

ho = ho.replace('((www\.[\s]+)|(https?://[^\s]+))','URL',regex=True)
ho =ho.replace(r'#([^\s]+)', r'\1', regex=True)
ho =ho.replace('\'"',regex=True)

lem = WordNetLemmatizer()
stem = PorterStemmer()
fg=stem.stem(a)

eng_stopwords = stopwords.words('english') 
ho = ho.to_frame(name=None)
a=ho.to_string(buf=None, columns=None, col_space=None, header=True, 
index=True, na_rep='NaN', formatters=None, float_format=None, 
sparsify=False, index_names=True, justify=None, line_width=None, 
max_rows=None, max_cols=None, show_dimensions=False)
wordList = word_tokenize(fg)                                     
wordList = [word for word in wordList if word not in eng_stopwords]  
print (wordList)

Input i.e. a :-

                                              tweet
0     1495596971.6034188::automotive auto ebc greens...
1     1495596972.330948::new free stock photo of cit...

getting output ( wordList) in this format:-

tweet
 0
1495596971.6034188
:
:automotive
auto

I want the output of a row in a row format only. How can I do it? If you have a better code for semantic analysis of twitter please share it with me.

Sibella answered 25/5, 2017 at 6:21 Comment(0)
B
44

In short:

df['Text'].apply(word_tokenize)

Or if you want to add another column to store the tokenized list of strings:

df['tokenized_text'] = df['Text'].apply(word_tokenize) 

There are tokenizers written specifically for twitter text, see http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.casual

To use nltk.tokenize.TweetTokenizer:

from nltk.tokenize import TweetTokenizer
tt = TweetTokenizer()
df['Text'].apply(tt.tokenize)

Similar to:

Bradstreet answered 25/5, 2017 at 7:18 Comment(4)
I'm glad the answer helped.Bradstreet
Your questions are going to get closed easily if you don't strip the irrelevant parts of your code and only post information crucial to your question. Make edits to the new question you ask ;PBradstreet
Sure, will do that and ask again. Thanks :)Sibella
@alvas, do you know why I am getting: TypeError: expected string or bytes-like object when running your code above on my pandas dataframe column with text. My only difference is I am using sent_tokenizer to just split into sentences as opposed to wordsDividivi

© 2022 - 2024 — McMap. All rights reserved.