Preserve punctuations using unnest_tokens() in tidytext in R
Asked Answered
G

2

8

I am using tidytext package in R to do n-gram analysis.

Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens function automatically removes all punctuations and convert text into lower case.

I found unnest_tokens has an option to use regular expression using token='regex', so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams' to do n-gram analysis.

Is there any way to prevent unnest_tokens from converting text into lowercase in n-gram analysis?

Ghirlandaio answered 12/6, 2017 at 23:23 Comment(3)
N.B. unnest_tokens makes use of tokenizers to the do its heavy lifting....And in said project there is tokenize_tweets.RPuggree
Looking at the source, tokenize_ngrams <- function(x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE). There is certainly an option to not lowercase in tokenize_ngrams. Worst case is to patch.Puggree
Thanks for the comments. I think unnest_tokens uses tokenize_words to clean text: tokenize_words <- function(x, lowercase = TRUE, stopwords = NULL, **strip_punct = TRUE**, strip_numeric = FALSE, simplify = FALSE) {... I changed strip_punct=FALSE and run it again but it still doesn't work.Ghirlandaio
J
1

Arguments for tokenize_words are available within the unnest_tokens function call. So you can use strip_punct = FALSE directly as an argument for unnest_tokens.

Example:

txt <- data.frame(text = "Arguments for `tokenize_words` are available within the `unnest_tokens` function call. So you can use `strip_punct = FALSE` directly as an argument for `unnest_tokens`. ", stringsAsFactors = F)
unnest_tokens(txt, palabras, "text", strip_punct =FALSE)

 palabras
 1         arguments
 1.1             for
 1.2               `
 1.3  tokenize_words
 1.4               `
 1.5             are
 1.6       available
 1.7          within
 1.8             the
 1.9               `
 1.10  unnest_tokens
 1.11              `
 1.12       function
 1.13           call
 1.14              .
 1.15             so
 #And some more, but you get the point. 

Also available: lowercase = FALSE and strip_numeric = TRUE to change the default opposite behavior.

Jewess answered 3/8, 2018 at 20:11 Comment(0)
N
0

In tidytext version 0.1.9 you now have the option to tokenize tweets and if you don't want lowercase, use the option to_lower = FALSE

unnest_tokens(tweet_df, word, tweet_column, token = "tweets", to_lower = FALSE)
Newby answered 3/6, 2018 at 7:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.