How nltk.TweetTokenizer different from nltk.word_tokenize?
Asked Answered
S

2

11

I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For which sort of data should I be using TweetTokenizer over word_tokenize?

Saphena answered 20/5, 2020 at 17:53 Comment(0)
D
23

Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer as a subset of word_tokenize. TweetTokenizer keeps hashtags intact while word_tokenize doesn't.

I hope the below example will clear all your doubts...

from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))

# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']

You can see that word_tokenize has split #dummysmiley as '#' and 'dummysmiley', while TweetTokenizer didn't, as '#dummysmiley'. TweetTokenizer is built mainly for analyzing tweets. You can learn more about tokenizer from this link

Dido answered 22/5, 2020 at 3:1 Comment(1)
In addition to this answer, aonther great tutorial on TweetTokenizer can also be found here and focuses on problems with tokenizing social media data.Incalescent
M
1

It also seems to deal differently with abbreviated negations ("isn't" for example):

from nltk.tokenize import (TweetTokenizer,
                           wordpunct_tokenize,)

text = "The quick brown fox isn't jumping over the lazy dog, co-founder 
multi-word expression. #yes!"

standard_nltk = word_tokenize(text)
print(standard_nltk)
# output: ['The', 'quick', 'brown', 'fox', 'is', "n't", 'jumping', 'over', 
# 'the', 'lazy', 'dog', ',', 'co-founder', 'multi-word', 'expression', '.', 
# '#', 'yes', '!']

twitter_nltk = tweet_tokenizer.tokenize(text)
print(twitter_nltk)
# output: ['The', 'quick', 'brown', 'fox', "isn't", 'jumping', 'over', 
# 'the', 'lazy', 'dog', ',', 'co-founder', 'multi-word', 'expression', '.', 
# '#yes', '!']
Marv answered 10/9, 2023 at 12:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.