I'm trying to clean up some text that was loaded into memory using readLines(..., encoding='UTF-8')
.
If I don't specify the encoding, I see all kinds of strange characters like:
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> 😜ðŸ˜â˜º'"
This is what it looks like after readLines(..., encoding='UTF-8'):
> "The way I talk to my family......i would get my ass beat to
> DEATH....but they kno I cray cray & just leave it at that
> \xf0\u009f\u0098\u009c\xf0\u009f\u0098\u009d☺"
You can see the unicode literals at the end: \u009f, \u0098, etc.
I can't find the right command and regular expression to get rid of these. I've tried:
gsub('[^[:punct:][:alnum:][\\s]]', '', text)
I also tried specifying the unicode characters, but I believe they're getting interpreted as text:
gsub('\u009', '', text) # Unchanged