Emoticons in Twitter Sentiment Analysis in r
Asked Answered
C

3

19

How do I handle/get rid of emoticons so that I can sort tweets for sentiment analysis?

Getting: Error in sort.list(y) : invalid input

Thanks

and this is how the emoticons come out looking from twitter and into r:

\xed��\xed�\u0083\xed��\xed��
\xed��\xed�\u008d\xed��\xed�\u0089 
Colligan answered 1/4, 2013 at 17:25 Comment(3)
try working with iconv()Aluin
And look at ?EncodingsTatting
May I suggest you figure out what these encodings mean. The emoticon is a form of language that conveys meaning that may not be captured in the formal text language. Not sure what you're after but these emoticons are sentiment, a way of representing gesture/facial expression in ways typical formal language may not afford. Again use the comments/solutions here not to eliminate the emoticons but to figure out what meaning is conveyed by the emoticon .Jack
K
22

This should get rid of the emoticons, using iconv as suggested by ndoogan.

Some reproducible data:

require(twitteR) 
# note that I had to register my twitter credentials first
# here's the method: https://mcmap.net/q/667217/-twitter-roauth-and-windows-register-ok-but-certificate-verify-failed/1036500
s <- searchTwitter('#emoticons', cainfo="cacert.pem") 

# convert to data frame
df <- do.call("rbind", lapply(s, as.data.frame))

# inspect, yes there are some odd characters in row five
head(df)

                                                                                                                                                text
1                                                                      ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 “@teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                      E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                                #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5  I use emoticons too much. #addicted #admittingit #emoticons <ed><U+00A0><U+00BD><ed><U+00B8><U+00AC><ed><U+00A0><U+00BD><ed><U+00B8><U+0081> haha
6                                                                                         What you text What I see #Emoticons http://t.co/BKowBSLJ0s

Here's the key line that will remove the emoticons:

# Clean text to remove odd characters
df$text <- sapply(df$text,function(row) iconv(row, "latin1", "ASCII", sub=""))

Now inspect again, to see if the odd characters are gone (see row 5)

head(df)    
                                                                                                                               text
1                                                                     ROFLOL: echte #emoticons [humor] http://t.co/0d6fA7RJsY via @tweetsmania  ;-)
2 @teeLARGE: when tmobile get the iphone in 2 wks im killin everybody w/ emoticons &amp; \nall the other stuff i cant see on android!" \n#Emoticons
3                     E poi ricevi dei messaggi del genere da tua mamma xD #crazymum #iloveyou #emoticons #aiutooo #bestlike http://t.co/Yee1LB9ZQa
4                                               #emoticons I want to change my name to an #emoticon. Is it too soon? #prince http://t.co/AgmR5Lnhrk
5                                                                                 I use emoticons too much. #addicted #admittingit #emoticons  haha
6                                                                                        What you text What I see #Emoticons http://t.co/BKowBSLJ0s
Kinross answered 2/4, 2013 at 0:29 Comment(2)
Ben- Thank you so much- that cleaned it up- Finally!Colligan
You're welcome! In case you're not familiar, you should upvote if answer was useful to you (that's the preferred way to say thanks here) and click on the tick (under the up/down arrows) to indicate that it was the best answer to your question. That will be helpful to other people who have the same question as you (this process is more relevant when there are multiple answers, in this case it's more for the fun of it).Kinross
B
2

I recommend the function:
ji_replace_all <- function (string, replacement)

From the package:
install_github (" hadley / emo ").

I needed to remove the emojis from tweets that were in the Spanish language. Tried several options, but some messed up the text for me. However this is a marvel that works perfectly:

library(emo)

text="#VIDEO 😢💔🙏🏻,Alguien sabe si en Afganistán hay cigarro?"

ji_replace_all(text,"")

Result:

"#VIDEO ,Alguien sabe si en Afganistán hay cigarro?"

Boastful answered 27/8, 2021 at 8:57 Comment(0)
C
1

You can use regular expression to detect non-alphabet characters and remove them. Sample code:

rmNonAlphabet <- function(str) {
  words <- unlist(strsplit(str, " "))
  in.alphabet <- grep(words, pattern = "[a-z|0-9]", ignore.case = T)
  nice.str <- paste(words[in.alphabet], collapse = " ")
  nice.str
}
Climber answered 31/7, 2015 at 8:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.