I am using tidytext
package in R
to do n-gram analysis.
Since I analyze tweets, I would like to preserve @ and # to capture mentions, retweets, and hashtags. However, unnest_tokens
function automatically removes all punctuations and convert text into lower case.
I found unnest_tokens
has an option to use regular expression using token='regex'
, so I can customize the way it cleans the text. But, it only works in unigram analysis and it doesn't work with n-gram because I need to define token='ngrams'
to do n-gram analysis.
Is there any way to prevent unnest_tokens
from converting text into lowercase in n-gram analysis?
unnest_tokens
makes use of tokenizers to the do its heavy lifting....And in said project there is tokenize_tweets.R – Puggreetokenize_ngrams <- function(x, lowercase = TRUE, n = 3L, n_min = n, stopwords = character(), ngram_delim = " ", simplify = FALSE)
. There is certainly an option to not lowercase intokenize_ngrams
. Worst case is to patch. – Puggreeunnest_tokens
usestokenize_words
to clean text:tokenize_words <- function(x, lowercase = TRUE, stopwords = NULL, **strip_punct = TRUE**, strip_numeric = FALSE, simplify = FALSE) {...
I changedstrip_punct=FALSE
and run it again but it still doesn't work. – Ghirlandaio